Edge Nodes in Hadoop Clusters
Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools.
They’re also often used as staging areas for data being transferred into the Hadoop cluster. As such, Oozie, Pig, Sqoop, and management tools such as Hue and Ambari run well there. The figure shows the processes you can run on Edge nodes.
Edge nodes are often overlooked in Hadoop hardware architecture discussions. This situation is unfortunate because edge nodes serve an important purpose in a Hadoop cluster, and they have hardware requirements that are different from master nodes and slave nodes.
In general, it’s a good idea to minimize deployments of administration tools on master nodes and slave nodes to ensure that critical Hadoop services like the NameNode have as little competition for resources as possible.
You should avoid placing a data transfer utility like Sqoop on anything but an edge node, as the high data transfer volumes could risk the ability of Hadoop services on the same node to communicate. The messages Hadoop services exchange are their lifeblood, so high latency means the whole node could be cut off from the cluster.
The figure shows two edge nodes, but for many Hadoop clusters a single edge node would suffice. Additional edge nodes are most commonly needed when the volume of data being transferred in or out of the cluster is too much for a single server to handle.
For edge nodes in a Hadoop cluster, use enterprise class storage. For edge nodes focused on administration tools and running client applications, use four 900GB SAS drives, along with a RAID HDD controller configured for RAID 1+0.
Edge nodes oriented to ingesting data obviously need much more storage space, so you can add drives to the edge node. In this case, use LFF SAS drives because much higher capacities are available, as compared to smaller form-factor SAS drives.
A general-purpose edge node would be well served by a processor configuration similar to one used for slave nodes — specifically, a dual-socket server with Ivy Bridge processors clocked at between 2 and 2.5GHz.
For most workloads on edge nodes, 48GB of RAM is sufficient.
To enable communication between the outside network and the Hadoop cluster, edge nodes need to be multi-homed into the private subnet of the Hadoop cluster as well as into the corporate network.
A multi-homed computer is one that has dedicated connections to multiple networks. This is a practical illustration of why edge nodes are perfectly suited for interaction with the world outside the Hadoop cluster. Keeping your Hadoop cluster in its own private subnet is an excellent practice, so these edge nodes serve as a controlled window inside the cluster.
For edge nodes that serve the purpose of running client applications or administration tools, two pairs of bonded 1GbE network connections are recommended: one pair to connect to the Hadoop cluster and another pair for the outside network.
Edge nodes oriented to handling high inbound and outbound data transfer rates will need two (or more) pairs of bonded 10GbE network connectors: one pair to connect to the Hadoop cluster and another pair for the outside network or specific data ingest sources.