Distributed Processing with Hadoop MapReduce - dummies

Distributed Processing with Hadoop MapReduce

By Dirk deRoos

Hadoop MapReduce involves the processing of a sequence of operations on distributed data sets. The data consists of key-value pairs, and the computations have only two phases: a map phase and a reduce phase. User-defined MapReduce jobs run on the compute nodes in the cluster.

Generally speaking, a MapReduce job runs as follows:

  1. During the Map phase, input data is split into a large number of fragments, each of which is assigned to a map task.

  2. These map tasks are distributed across the cluster.

  3. Each map task processes the key-value pairs from its assigned fragment and produces a set of intermediate key-value pairs.

  4. The intermediate data set is sorted by key, and the sorted data is partitioned into a number of fragments that matches the number of reduce tasks.

  5. During the Reduce phase, each reduce task processes the data fragment that was assigned to it and produces an output key-value pair.

  6. These reduce tasks are also distributed across the cluster and write their output to HDFS when finished.

The Hadoop MapReduce framework in earlier (pre-version 2) Hadoop releases has a single master service called a JobTracker and several slave services called TaskTrackers, one per node in the cluster.

When you submit a MapReduce job to the JobTracker, the job is placed into a queue and then runs according to the scheduling rules defined by an administrator. As you might expect, the JobTracker manages the assignment of map-and-reduce tasks to the TaskTrackers.

With Hadoop 2, a new resource management system is in place called YARN (short for Yet Another Resource Manager). YARN provides generic scheduling and resource management services so that you can run more than just MapReduce applications on your Hadoop cluster. The JobTracker/TaskTracker architecture could only run MapReduce.

HDFS also has a master/slave architecture:

  • Master service: Called a NameNode, it controls access to data files.

  • Slave services: Called DataNodes, they’re distributed one per node in the cluster. DataNodes manage the storage that’s associated with the nodes on which they run, serving client read and write requests, among other tasks.