The Importance of MapReduce in Hadoop - dummies

The Importance of MapReduce in Hadoop

By Dirk deRoos

For most of Hadoop’s history, MapReduce has been the only game in town when it comes to data processing. The availability of MapReduce has been the reason for Hadoop’s success and at the same time a major factor in limiting further adoption.

MapReduce enables skilled programmers to write distributed applications without having to worry about the underlying distributed computing infrastructure. This is a very big deal: Hadoop and the MapReduce framework handle all sorts of complexity that application developers don’t need to handle.

For example, the ability to transparently scale out the cluster by adding nodes and the automatic failover of both data storage and data processing subsystems happen with zero impact on applications.

The other side of the coin here is that although MapReduce hides a tremendous amount of complexity, you can’t afford to forget what it is: an interface for parallel programming. This is an advanced skill — and a barrier to wider adoption. There simply aren’t yet many MapReduce programmers, and not everyone has the skill to master it.

In Hadoop’s early days (Hadoop 1 and before), you could only run MapReduce applications on your clusters. In Hadoop 2, the YARN component changed all that by taking over resource management and scheduling from the MapReduce framework, and providing a generic interface to facilitate applications to run on a Hadoop cluster.

In short, this means MapReduce is now just one of many application frameworks you can use to develop and run applications on Hadoop. Though it’s certainly possible to run applications using other frameworks on Hadoop, it doesn’t mean that we can start forgetting about MapReduce.

MapReduce currently is the only production-ready data processing framework available for Hadoop. Though other frameworks will eventually become available, MapReduce has almost a decade of maturity under its belt (with almost 4,000 JIRA issues completed, involving hundreds of developers, if you’re keeping track).

There’s no dispute: MapReduce is Hadoop’s most mature framework for data processing. In addition, a significant amount of MapReduce code is now in use that’s unlikely to go anywhere soon. Long story short: MapReduce is an important part of the Hadoop story.

The Apache Hive and Apache Pig projects are highly popular because they’re easier entry points for data processing on Hadoop. For many problems, especially the kinds that you can solve with SQL, Hive and Pig are excellent tools. But for a wider-ranging task such as statistical processing or text extraction, and especially for processing unstructured data, you need to use MapReduce.