Alternate Deployment Form Factors for Hadoop

By Dirk deRoos

Though Hadoop works best when it’s installed on a physical computer, where the processing has direct access to dedicated storage and networking, Hadoop has alternative deployments. And though they are less efficient than the dedicated hardware, in certain cases alternatives are worthwhile options.

Virtualized servers

A major trend in IT centers over the past decade is virtualization, where a large server can host several “virtual machines” which look and act like single machines. In place of dedicated hardware, an organization’s entire set of applications and repositories is deployed on virtualized hardware.

This approach has many advantages: The centralization of IT simplifies maintenance, IT investment is maximized because of fewer unused CPU cycles, and the overall hardware footprint is lower, resulting in a lower total cost of ownership.

Organizations in which IT deployments are entirely virtualized sometimes mandate that every new application follow this model. Though Hadoop can be deployed in this manner, essentially as a virtual cluster (with virtual master nodes and virtual slave nodes), performance suffers, partially because for most virtualized environments, storage is SAN-based and isn’t locally attached.

Because Hadoop is designed to work best when all available CPU cores are able to have fast access to independently spinning disks, a bottleneck is created as all the map and reduce tasks start processing data via the limited networking between the CPUs and the SAN. Since the degree of isolation between virtualized server resources is limited (virtual servers share resources with each other), Hadoop workloads can also be affected by other activity.

When your virtual server’s performance is affected by another server’s workload, that’s actually known in IT circles as a “noisy neighbor” problem!

Virtualized environments can be quite useful, though, in some cases. For example, if your organization needs to complete a one-time exploratory analysis of a large data set, you can easily create a temporary cluster in your virtualized environment. This method is often a faster way to gain internal approval than to endure the bureaucratic hassles of procuring new dedicated hardware.

As you experiment with Hadoop, you often run it on your laptop machines via a virtual machine (VM). Hadoop is extremely slow in this kind of environment, but if you’re using small data sets, it’s a valuable learning and testing tool.

Cloud deployments

Variations of virtualized environments are cloud computing providers such as Amazon, Rackspace, and IBM SoftLayer. Most major public cloud providers now have MapReduce or Hadoop offerings available for use. Again, their performance is inferior to deploying your cluster on dedicated hardware, but it’s improving.

Cloud providers are making Hadoop-optimized environments available where slave nodes have locally attached storage and dedicated networking. Also, hypervisors are becoming far more efficient, with reduced overhead and latency.

Don’t consider a cloud solution for long-term applications, because the cost of renting cloud computing resources is significantly higher than that of owning and maintaining a comparable system. With a cloud provider, you’re paying for convenience and for being able to offload the overhead of provisioning hardware. However, the cloud is an ideal platform for testing, education, and one-time data processing tasks.

Aside from performance and cost considerations, you have regulatory considerations with public cloud deployments. If you have sensitive data, which must be stored either in-house or in-country, a public cloud deployment isn’t an option. In cases like this, where you need the convenience of a cloud-based deployment, a private cloud is a good option, if it’s available.