The MapReduce Programming Paradigm
MapReduce is a programming paradigm that was designed to allow parallel distributed processing of large sets of data, converting them to sets of tuples, and then combining and reducing those tuples into smaller sets of tuples. In layman’s terms, MapReduce was designed to take big data and use parallel distributed computing to turn big data into little- or regular-sized data.
Parallel distributed processing refers to a powerful framework where mass volumes of data are processed very quickly by distributing processing tasks across clusters of commodity servers. With respect to MapReduce, tuples refer to key-value pairs by which data is grouped, sorted, and processed.
MapReduce jobs work via map and reduce process operation sequences across a distributed set of servers. In the map task, you delegate your data to key-value pairs, transform it, and filter it. Then you assign the data to nodes for processing.
In the reduce task, you aggregate that data down to smaller sized datasets. Data from the reduce step is transformed into a standard key-value format — where the key acts as the record identifier and the value is the value that’s being identified by the key. The clusters’ computing nodes process the map and reduce tasks that are defined by the user. This work is done in accordance with the following two steps:
Map the data.
The incoming data must first be delegated into key-value pairs and divided into fragments, which are then assigned to map tasks. Each computing cluster — a group of nodes that are connected to each other and perform a shared computing task — is assigned a number of map tasks, which are subsequently distributed among its nodes.
Upon processing of the key-value pairs, intermediate key-value pairs are generated. The intermediate key-value pairs are sorted by their key values, and this list is divided into a new set of fragments. Whatever count you have for these new fragments, it will be the same as the count of the reduce tasks.
Reduce the data.
Every reduce task has a fragment assigned to it. The reduce task simply processes the fragment and produces an output, which is also a key-value pair. Reduce tasks are also distributed among the different nodes of the cluster. After the task is completed, the final output is written onto a file system.
In short, you can quickly and efficiently boil down and begin to make sense of a huge volume, velocity, and variety of data by using map and reduce tasks to tag your data by (key, value) pairs, and then reduce those pairs into smaller sets of data through aggregation operations — operations that combine multiple values from a dataset into a single value. A diagram of the MapReduce architecture can be found here.
If your data doesn’t lend itself to being tagged and processed via keys, values, and aggregation, then map and reduce generally isn’t a good fit for your needs.
If you’re using MapReduce as part of a Hadoop solution, then the final output is written onto the Hadoop Distributed File System (HDFS). HDFS is a file system that includes clusters of commodity servers that are used to store big data. HDFS makes big data handling and storage financially feasible by distributing storage tasks across clusters of cheap commodity servers.