Running Applications Before Hadoop 2 - dummies

Running Applications Before Hadoop 2

By Dirk deRoos

Because many existing Hadoop deployments still aren’t yet using Yet Another Resource Negotiator (YARN), take a quick look at how Hadoop managed its data processing before the days of Hadoop 2. Concentrate on the role that JobTracker master daemons and TaskTracker slave daemons played in handling MapReduce processing.

The whole point of employing distributed systems is to be able to deploy computing resources in a network of self-contained computers in a manner that’s fault-tolerant, easy, and inexpensive.

In a distributed system such as Hadoop, where you have a cluster of self-contained compute nodes all working in parallel, a great deal of complexity goes into ensuring that all the pieces work together. As such, these systems typically have distinct layers to handle different tasks to support parallel data processing.

This concept, known as the separation of concerns, ensures that if you are, for example, the application programmer, you don’t need to worry about the specific details for, say, the failover of map tasks. In Hadoop, the system consists of these four distinct layers, as shown:

  • Distributed storage: The Hadoop Distributed File System (HDFS) is the storage layer where the data, interim results, and final result sets are stored.

  • Resource management: In addition to disk space, all slave nodes in the Hadoop cluster have CPU cycles, RAM, and network bandwidth. A system such as Hadoop needs to be able to parcel out these resources so that multiple applications and users can share the cluster in predictable and tunable ways. This job is done by the JobTracker daemon.

  • Processing framework: The MapReduce process flow defines the execution of all applications in Hadoop 1. This begins with the map phase; continues with aggregation with shuffle, sort, or merge; and ends with the reduce phase. In Hadoop 1, this is also managed by the JobTracker daemon, with local execution being managed by TaskTracker daemons running on the slave nodes.

  • Application Programming Interface (API): Applications developed for Hadoop 1 needed to be coded using the MapReduce API. In Hadoop 1, the Hive and Pig projects provide programmers with easier interfaces for writing Hadoop applications, and underneath the hood, their code compiles down to MapReduce.

    image0.jpg

In the world of Hadoop 1 (which was the only world you had until quite recently), all data processing revolved around MapReduce.