The Reduce Phase of Hadoop’s MapReduce Application Flow
The Reduce phase processes the keys and their individual lists of values so that what’s normally returned to the client application is a set of key/value pairs. Here’s the blow-by-blow so far: A large data set has been broken down into smaller pieces, called input splits, and individual instances of mapper tasks have processed each one of them.
In some cases, this single phase of processing is all that’s needed to generate the desired application output. For example, if you’re running a basic transformation operation on the data — converting all text to uppercase, for example, or extracting key frames from video files — the lone phase is all you need. (This is known as a map-only job, by the way.)
But in many other cases, the job is only half-done when the mapper tasks have written their output. The remaining task is boiling down all interim results to a single, unified answer.
Similar to the mapper task, which processes each record one-by-one, the reducer processes each key individually. Normally, the reducer returns a single key/value pair for every key it processes. However, these key/value pairs can be as expansive or as small as you need them to be.
When the reducer tasks are finished, each of them returns a results file and stores it in HDFS (Hadoop Distributed File System). As shown here, the HDFS system then automatically replicates these results.
Where the Resource Manager (or JobTracker if you’re using Hadoop 1) tries its best to assign resources to mapper tasks to ensure that input splits are processed locally, there is no such strategy for reducer tasks. It is assumed that mapper task result sets need to be transferred over the network to be processed by the reducer tasks.
This is a reasonable implementation because, with hundreds or even thousands of mapper tasks, there would be no practical way for reducer tasks to have the same locality prioritization.