Slave Nodes in Hadoop Clusters - dummies

Slave Nodes in Hadoop Clusters

By Dirk deRoos

In a Hadoop universe, slave nodes are where Hadoop data is stored and where data processing takes place. The following services enable slave nodes to store and process data:

  • NodeManager: Coordinates the resources for an individual slave node and reports back to the Resource Manager.

  • ApplicationMaster: Tracks the progress of all the tasks running on the Hadoop cluster for a specific application. For each client application, the Resource Manager deploys an instance of the ApplicationMaster service in a container on a slave node. (Remember that any node running the NodeManager service is visible to the Resource Manager.)

  • Container: A collection of all the resources needed to run individual tasks for an application. When an application is running on the cluster, the Resource Manager schedules the tasks for the application to run as container services on the cluster’s slave nodes.

  • TaskTracker: Manages the individual map and reduce tasks executing on a slave node for Hadoop 1 clusters. In Hadoop 2, this service is obsolete and has been replaced by YARN services.

  • DataNode: An HDFS service that enables the NameNode to store blocks on the slave node.

  • RegionServer: Stores data for the HBase system. In Hadoop 2, HBase uses Hoya, which enables RegionServer instances to be run in containers.

    image0.jpg

Here, each slave node is always running a DataNode instance (which enables HDFS to store and retrieve data blocks on the slave node) and a NodeManager instance (which enables the Resource Manager to assign application tasks to the slave node for processing). The container processes are individual tasks for applications that are running on the cluster.

Each running application has a dedicated ApplicationMaster task, which also runs in a container, and tracks the execution of all the tasks executing on the cluster until the application is finished.

With HBase on Hadoop 2, the container model is still followed, as you can see:

image1.jpg

HBase on Hadoop 2 is initiated by the Hoya Application Master, which requests containers for the HMaster services. (You need multiple HMaster services for redundancy.) The Hoya Application Master also requests resources for RegionServers, which likewise run in special containers.

The following figure shows the services deployed on Hadoop 1 slave nodes.

image2.jpg

For Hadoop 1, each slave node is always running a DataNode instance (which enables HDFS to store and retrieve data blocks on the slave node) and a TaskTracker instance (which enables the JobTracker to assign map and reduce tasks to the slave node for processing).

Slave nodes have a fixed number of map slots and reduce slots for the execution of map and reduce tasks respectively. If your cluster is running HBase, a number of your slave nodes will need to run a RegionServer service. The more data you store in HBase, the more RegionServer instances you’ll need.

The hardware criteria for slave nodes are rather different from those for master nodes; in fact, the criteria don’t match those found in traditional hardware reference architectures for data servers. Much of the buzz surrounding Hadoop is due to the use of commodity hardware in the design criteria of Hadoop clusters, but keep in mind that commodity hardware does not refer to consumer-grade hardware.

Hadoop slave nodes still require enterprise-grade hardware, but at the lower end of the cost spectrum, especially for storage.