Tracking JobTracker and TaskTracker in Hadoop 1 - dummies

Tracking JobTracker and TaskTracker in Hadoop 1

By Dirk deRoos

MapReduce processing in Hadoop 1 is handled by the JobTracker and TaskTracker daemons. The JobTracker maintains a view of all available processing resources in the Hadoop cluster and, as application requests come in, it schedules and deploys them to the TaskTracker nodes for execution.

As applications are running, the JobTracker receives status updates from the TaskTracker nodes to track their progress and, if necessary, coordinate the handling of any failures. The JobTracker needs to run on a master node in the Hadoop cluster as it coordinates the execution of all MapReduce applications in the cluster, so it’s a mission-critical service.

An instance of the TaskTracker daemon runs on every slave node in the Hadoop cluster, which means that each slave node has a service that ties it to the processing (TaskTracker) and the storage (DataNode), which enables Hadoop to be a distributed system.

As a slave process, the TaskTracker receives processing requests from the JobTracker. Its primary responsibility is to track the execution of MapReduce workloads happening locally on its slave node and to send status updates to the JobTracker.

TaskTrackers manage the processing resources on each slave node in the form of processing slots — the slots defined for map tasks and reduce tasks, to be exact. The total number of map and reduce slots indicates how many map and reduce tasks can be executed at one time on the slave node.

When it comes to tuning a Hadoop cluster, setting the optimal number of map and reduce slots is critical. The number of slots needs to be carefully configured based on available memory, disk, and CPU resources on each slave node. Memory is the most critical of these three resources from a performance perspective. As such, the total number of task slots needs to be balanced with the maximum amount of memory allocated to the Java heap size.

Keep in mind that every map and reduce task spawns its own Java virtual machine (JVM) and that the heap represents the amount of memory that’s allocated for each JVM. The ratio of map slots to reduce slots is also an important consideration.

For example, if you have too many map slots and not enough reduce slots for your workloads, map slots will tend to sit idle, while your jobs are waiting for reduce slots to become available.

Distinct sets of slots are defined for map tasks and reduce tasks because they use computing resources quite differently. Map tasks are assigned based on data locality, and they depend heavily on disk I/O and CPU. Reduce tasks are assigned based on availability, not on locality, and they depend heavily on network bandwidth because they need to receive output from map tasks.