The Shuffle Phase of Hadoop’s MapReduce Application Flow
After the Map phase and before the beginning of the Reduce phase is a handoff process, known as shuffle and sort. Here, data from the mapper tasks is prepared and moved to the nodes where the reducer tasks will be run. When the mapper task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then written to disk.
You can see this concept in the following figure, which shows the MapReduce data processing flow and its interaction with the physical components of the Hadoop cluster. (One quick note: Data in memory is represented by white squares, and data stored to disk is represented by gray squares.)
To speed up the overall MapReduce process, data is immediately moved to the reducer tasks’ nodes, to avoid a flood of network activity when the final mapper task finishes its work. This transfer happens while the mapper task is running, as the outputs for each record — remember — are stored in the memory of a waiting reducer task. (You can configure whether this happens — or doesn’t happen — and also the number of threads involved.)
Keep in mind that even though a reducer task might have most of the mapper task’s output, the reduce task’s processing cannot begin until all mapper tasks have finished.
To avoid scenarios where the performance of a MapReduce job is hampered by one straggling mapper task that’s running on a poorly performing slave node, the MapReduce framework uses a concept called speculative execution.
In case some mapper tasks are running slower than what’s considered reasonable, the Application Master will spawn duplicate tasks (in Hadoop 1, the JobTracker does this). Whichever task finishes first — the duplicate or the original — its results are stored to disk, and the other task is killed. If you’re monitoring your jobs closely and are wondering why there are more mapper tasks running than you expect, this is a likely reason.
The output from mapper tasks isn’t written to HDFS, but rather to local disk on the slave node where the mapper task was run. As such, it’s not replicated across the Hadoop cluster.
Aside from compressing the output, you can potentially boost performance by running a combiner task. This simple tactic, shown here, involves performing a local reduce of the output for individual mapper tasks.
In the majority of cases, no extra programming is needed, as you can tell the system to use the reducer function. If you’re not using your reducer function, you need to ensure that the combiner function’s output is identical to that of the reducer function.
It’s up to the MapReduce framework whether the combiner function needs to be run once, multiple times, or never, so it’s critical that the combiner’s code ensures that the final results are unaffected by multiple runs. Running the combiner can yield a performance benefit by lessening the amount of intermediate data that would otherwise need to be transferred over the network.
This also lowers the amount of processing the reducer tasks would need to do. You are running an extra task here, so it is possible that any performance gain is negligible or may even result in worse overall performance. Your mileage may vary, so test this carefully.
After all the results of the mapper tasks are copied to the reducer tasks’ nodes, these files are merged and sorted.