HBase Tuning Prerequisites

By Dirk deRoos

Any serious HBase installation requires some standard setup on your cluster and on your individual nodes. A few examples are provided here. First take a look at monitoring and management.

Tools to monitor your cluster

If you’ve had the privilege of engineering a system at some point in your career, you know you face the major challenge of coming up with a rigorous testing procedure to ensure that your system is ready for its production phase. If you don’t plan for testing and debugging right up front, you’ll likely miss your production deadlines or fail altogether.

The HBase and Hadoop committers made sure that you would have a rich metrics subsystem to draw on during the debug and test phase. You can find all the messy details in the Apache HBase online documentation, especially the sections dealing with HBase Backup and Replication.

The Cluster Replication feature is a key tool when debugging, tuning or if you want to run Map Reduce against your tables without impacting performance. Obviously, you’ll need it for disaster recover as well.

Getting started with the Hadoop management tools set is surprisingly easy. HBase leverages the Java Management Extensions (JMX) technology for exposing key metrics. And with the Java Virtual Machine, you also get the JConsole tool, a free JMX client that you can use to view HBase metrics.

The HBase distribution we’ve been working with (0.94.7) enables access via JConsole by default, so in your standalone environment you simply select the HBase server that you want to monitor and JConsole then presents you with a graphical user interface for viewing key server metrics.

You can start the JConsole tool with the following command: $JAVA_HOME/bin/jconsole

Additionally, you should familiarize yourself with these two other open source technologies for monitoring your HBase cluster:

  • Ganglia: Often used to provide monitoring graphs over time, Ganglia can help you spot problems that occur occasionally or only after days of operation.

  • Nagios: Nagios is useful if you’re an HBase administrator and you want to receive a page on your pager or an e-mail if, say, a RegionServer goes down or you have a garbage collection issue in your cluster.

If you’re leveraging HBase as part of a commercial product, be sure to check with your vendor for a tool to monitor and manage HBase.

Cluster setup

HBase typically deploys on a cluster, and you’ll need to make some adjustments on each of your servers to accommodate HBase components. A good first step is ensuring that the system clocks on each server in your cluster are in sync.

Out of sync system clocks on your servers can really confuse HBase, so check out the Network Time Protocol or NTP for short. Running the NTP on your cluster will take care of any time synchronization issues.

Furthermore, HBase is a unique application in certain respects because it stresses your system beyond the level that applications may do. The truth is that HBase is going to be opening a lot of files — that’s just the nature of the beast.

Given that fact, you need to ensure that your operating systems are configured to handle what is sure to be a far-from-typical file system load. Swapping in your Linux operating systems (moving between disk and memory, in other words) can have very adverse effects on Zookeeper.

Finally there’s the Java Virtual Machine (JVM) that ultimately runs on each of your nodes and executes the HBase processes. HBase also puts far-from-typical stress on the JVM. (For example, the MemStore cache, which heavily exercises the garbage collection system, is sure to be taxed to the max.)

When the MemStore is committed to HFiles on the HDFS, the Java heap is reclaimed. This can result in long garbage collection pauses if your JVM is not configured correctly.

So for all of these reasons and more you should review these two sections of the Apache HBase online documentation:

  • General Configuration Requirements: Review Chapter 2 of the Apache HBase online documentation and especially section 2.5 titled “The Important Configurations“.

  • Java Virtual Machine: Determine which JVM you’re running and make sure that it has been tested for compatibility with HBase. The Apache HBase online documentation suggests Java 6 from Oracle because Java 7 hasn’t been fully tested.

    Another JVM is IBM’s J9. If you plan to use J9, review the IBM documentation for the latest command line options when starting your JVMs.

Enabling compression

Compression boosts HBase performance by reducing overall disk input/output. Consider enabling compression unless your data doesn’t compress well (images, for example) or if your RegionServers cannot handle the additional CPU load that compression and decompression requires.

Compression can be enabled via the HBase shell command. By default, compression is disabled per column family. The supported compression types are Gzip, LZO and Snappy (with some other derivatives available and more on the way). GZIP is best overall for achieving a good compression ratio, but LZO and Snappy are faster.

Keep in mind, though, that both LZO and Snappy compression codecs must be installed separately; only Gzip works without further configuration steps. The listing shows the steps you’d need to enable Gzip compression on the Customer Contact Information table:

hbase(main):007:0> disable 'CustomerContactInfo'
hbase(main):010:0> alter 'CustomerContactInfo', { NAME => 'CustomerName', COMPRESSION => 'GZ' }
hbase(main):014:0> describe 'CustomerContactInfo'
 {NAME => 'CustomerName', REPLICATION_SC                                                    
 OPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'GZ',…
hbase(main):017:0> enable 'CustomerContactInfo'