Sizing Your Hadoop Cluster

By Dirk deRoos

Sizing any data processing system is as much a science as it is an art. With Hadoop, you consider the same information as you would with a relational database, for example. Most significantly, you need to know how much data you have, estimate its expected growth rates, and establish a retention policy (how long to keep the data).

The answers to these questions serve as your starting point, which is independent of any technology-related requirements.

After you determine how much data you need to store, you can start factoring in Hadoop-specific considerations. Suppose that you have a telecom company and you’ve established that you need 750 terabytes (TB) of storage space for its call detail record (CDR) log files.

You retain these records to obey government regulations, but you can also analyze them to see churn patterns and monitor network health, for example. To determine how much storage space you need and, as a result, how many racks and slave nodes you need, you carry out your calculations with these factors in mind:

  • Replication: The default replication factor for data in HDFS is 3. The 500 terabytes of CDR data for the telecom company in the example then turns into 1500 terabytes.

  • Swap space: Any analysis or processing of the data by MapReduce needs an additional 25 percent of space to store any interim and final result sets. (The telecom company now needs 1875 terabytes of storage space.)

  • Compression: The telecom company stores the CDRs in a compressed form, where the average compression ratio is expected to be 3:1. You now need 625 terabytes.

  • Number of slave nodes: Assuming that each slave node has twelve 3TB drives dedicated to HDFS, each slave node has 36 terabytes of raw HDFS storage available, so the company needs 18 slave nodes.

  • Number of racks: Because each slave node uses 2RU and the company in the example needs three master nodes (1RU apiece) and two ToR switches (1RU apiece), you need a total of 41RU. It’s 1RU less than the total capacity of a standard rack, so a single rack is sufficient for this deployment.

    Regardless, no room remains for growth in this cluster, so it’s prudent to buy a second rack (and two additional ToR switches) and divide the slave nodes between the two racks.

  • Testing: Maintaining a test cluster that’s a smaller scale representation of the production cluster is a standard practice. It doesn’t have to be huge, but you want at least five data nodes so that you get an accurate representation of Hadoop’s behavior. As with any test environment, it should be isolated on a different network from the production cluster.

  • Backup and disaster recovery: Like any production system, the telecom company will also need to consider backup and disaster recovery requirements. This company could go as far as to create a mirror cluster to ensure they have a hot standby for their entire system. This is obviously the most expensive option, but is appropriate for environments where constant uptime is critical.

    At the least expensive end of the spectrum (beyond not backing up the data at all), the telecom company could regularly backup all data (including the data itself, applications, configuration files, and metadata) being stored in their production cluster to tape. With tape, the data is not immediately accessible, but it will enable a disaster recovery effort in the case that the entire production Hadoop cluster fails.

As with your own personal computer, when the main hard disk drive fills with space, the system slows down considerably. Hadoop is no exception. Also, a hard drive performs better when it’s less than 85 to 90 percent full. With this information in mind, if performance is important to you, you should bump up the swap-space factor from 25 to 33 percent.