The MapReduce Application Flow in Hadoop - dummies

The MapReduce Application Flow in Hadoop

By Dirk deRoos

At its core, MapReduce is a programming model for processing data sets that are stored in a distributed manner across a Hadoop cluster’s slave nodes. The key concept here is divide and conquer. Specifically, you want to break a large data set into many smaller pieces and process them in parallel with the same algorithm.

With the Hadoop Distributed File System (HDFS), the files are already divided into bite-sized pieces. MapReduce is what you use to process all the pieces.

MapReduce applications have multiple phases, as spelled out in this list:

  1. Determine the exact data sets to process from the data blocks. This involves calculating where the records to be processed are located within the data blocks.

  2. Run the specified algorithm against each record in the data set until all the records are processed.

    The individual instance of the application running against a block of data in a data set is known as a mapper task. (This is the mapping part of MapReduce.)

  3. Locally perform an interim reduction of the output of each mapper.

    (The outputs are provisionally combined, in other words.) This phase is optional because, in some common cases, it isn’t desirable.

  4. Based on partitioning requirements, group the applicable partitions of data from each mapper’s result sets.

  5. Boil down the result sets from the mappers into a single result set — the Reduce part of MapReduce.

    An individual instance of the application running against mapper output data is known as a reducer task. (As strange as it may seem, since “Reduce” is part of the MapReduce name, this phase can be optional; applications without a reducer are known as map-only jobs, which can be useful when there’s no need to combine the result sets from the map tasks.)