How to Write MapReduce Applications

By Dirk deRoos

The MapReduce API is written in Java, so MapReduce applications are primarily Java-based. The following list specifies the components of a MapReduce application that you can develop:

  • Driver (mandatory): This is the application shell that’s invoked from the client. It configures the MapReduce class (which you do not customize) and submits it to the Resource Manager (or JobTracker if you’re using Hadoop 1).

  • class (mandatory): The class you implement needs to define the formats of the key/value pairs you input and output as you process each record. This class has only a single method, which is where you code how each record will be processed and what key/value to output. To output key/value pairs from the mapper task, write them to an instance of the class.

  • class (optional): The reducer is optional for map-only applications where the Reduce phase isn’t needed.

  • class (optional): A combiner can often be defined as a reducer, but in some cases it needs to be different. (Remember, for example, that a reducer may not be able to run multiple times on a data set without mutating the results.)

  • class (optional): Customize the default partitioner to perform special tasks, such as a secondary sort on the values for each key or for rare cases involving sparse data and imbalanced output files from the mapper tasks.

  • andclasses (optional): Hadoop has some standard data formats (for example, text files, sequence files, and databases), which are useful for many cases. For specifically formatted data, implementing your own classes for reading and writing data can greatly simplify your mapper and reducer code.

From within the driver, you can use the MapReduce API, which includes factory methods to create instances of all components in the preceding list. (In case you’re not a Java person, a factory method is a tool for creating objects.)

A generic API named Hadoop Streaming lets you use other programming languages (most commonly, C, Python, and Perl). Though this API enables organizations with non-Java skills to write MapReduce code, using it has some disadvantages.

Because of the additional abstraction layers that this streaming code needs to go through in order to function, there’s a performance penalty and increased memory usage. Also, you can code mapper and reducer functions only with Hadoop Streaming. Record readers and writers, as well as all your partitioners, need to be written in Java.

A direct consequence — and additional disadvantage — of being unable to customize record readers and writers is that Hadoop Streaming applications are well suited to handle only text-based data.