Hardware Requirements for HBase

By Dirk deRoos

HBase is a powerful and flexible technology, but accompanying this flexibility is the requirement for proper configuration and tuning. It’s time for some general guidelines for configuring HBase clusters. Your “mileage” may vary, depending on specific compute requirements for your RegionServers (custom coprocessors, for example) and other applications you may choose to co-locate on your cluster.

RegionServers

The first temptation to resist when configuring your RegionServers is plunking down lots of cash for some high end enterprise systems. Don’t do it! HBase is typically deployed on plain vanilla commodity x86 servers.

Now, don’t take that statement as license to deploy the cheapest, low quality servers. Yes, HBase is designed to recover from node failures but your availability suffers during recovery periods so hardware quality and redundancy do matter.

Redundant power supplies as well as redundant network interface cards are a good idea for production deployments. Typically, organizations choose two socket machines with four to six cores each.

The second temptation to resist is configuring your server with the maximum storage and memory capacity. A common configuration would include from 6 to 12 terabytes (TB) of disk space and from 48 to 96 gigabytes (GB) of RAM. RAID controllers for the disks are unnecessary because HDFS provides data protection when disks fail.

HBase requires a read and write cache that’s allocated from the Java heap. Keep this statement in mind as you read about the HBase configuration variables because you’ll see that a direct relationship exists between a RegionServer’s disk capacity and a RegionServer’s Java heap. Check out an excellent discussion on HBase RegionServer memory sizing.

The article points out that you can estimate the ratio of raw disk space to Java heap by following this formula:

RegionSize divided by Memstoresize multiplied by HDFS Replication Factor multiplied by HeapFractionForMemstores

Using the default HBase configuration variables provides this ratio:

10GB / 128MB * 3 * 0.4 = Ratio of 96MB disk space : 1 MB Java heap space.

The preceding line equates to 3TB of raw disk capacity per RegionServer with 32GB of RAM allocated to the Java heap.

What you end up with, then, is 1 terabyte of usable space per RegionServer since the default HDFS replication factor is 3. This number is still impressive in terms of database storage per node but not so impressive given that commodity servers can typically accommodate eight or more drives with a capacity of 2 to 4 terabyte a piece.

The overarching problem as of this writing is the fact that current Java Virtual Machines (JVMs) struggle to provide efficient memory management (garbage collection, to be precise) with large heap spaces (spaces greater than 32GB, for example).

Yes, there are garbage collection tuning parameters you can use, and you should check with your JVM vendor to insure you have the latest options, but you won’t be able to get very far using them at this time.

The memory management issue will eventually be solved but for now be aware that you may encounter a problem if your HBase storage requirements are in the range of hundreds of terabytes to more than a petabyte. You can easily increase to 20GB to reach 6TB raw and 2TB usable.

You can make other tweaks (reducing MemStore size for read heavy workloads, for example) but you won’t make orders of magnitude leaps in the useable space until we have a JVM that efficiently handles garbage collection with massive heaps.

You can find ways around the JVM garbage collection issue for RegionServers but the solutions are new and are not yet part of the main HBase distribution as of this writing.

Master servers

The MasterServer does not consume system resources like the RegionServers do. However, you should provide for hardware redundancy, including RAID to prevent system failure. For good measure, also configure a backup MasterServer into the cluster. A common configuration is 4 CPU cores, between 8GB and 16GB of RAM and 1 Gigabit Ethernet is a common configuration. If you co-locate MasterServers and Zookeeper nodes, 16GB of RAM is advisable.

Zookeeper

Like the MasterServer, Zookeeper doesn’t require a large hardware configuration, but Zookeeper must not block (or be required to compete for) system resources. Zookeeper, which is the coordination service for an HBase cluster, sits in the data path for clients. If Zookeeper cannot do its job, time-outs will occur — and the results can be catastrophic.

Zookeeper hardware requirements are the same as for the MasterServer except that a dedicated disk should be provided for the process. For small clusters you can co-locate Zookeeper with the master server but remember that Zookeeper needs sufficient system resources to run when ready.