By Dirk deRoos

Compaction, the process by which HBase cleans up after itself, comes in two flavors: major and minor. Major compactions can be a big deal, but first you need to understand minor compactions.

Minor compactions combine a configurable number of smaller HFiles into one larger HFile. You can tune the number of HFiles to compact and the frequency of a minor compaction. Minor compactions are important because without them, reading a particular row can require many disk reads and cause slow overall performance.

The figure, which illustrates how this concept works, can help you visualize how the following table can be persisted on the HDFS.

Logical View of Customer Contact Information in HBase
Row Key Column Family: {Column Qualifier:Version:Value}
00001 CustomerName: {‘FN’:
1383859182496:‘John’,
‘LN’: 1383859182858:‘Smith’,
‘MN’: 1383859183001:’Timothy’,
‘MN’: 1383859182915:’T’}
ContactInfo: {‘EA’:
1383859183030:‘John.Smith@xyz.com’,
’SA’: 1383859183073:’1 Hadoop Lane, NY
11111’}
00002 CustomerName: {‘FN’:
1383859183103:‘Jane’,
‘LN’: 1383859183163:‘Doe’,
ContactInfo: {
’SA’: 1383859185577:’7 HBase Ave, CA
22222’}

image0.jpg

Notice how the CustomerName column family was written to the HDFS with two MemStore flushes and how the data in the ContactInfo column family was persisted to disk with only one MemStore flush. This example is hypothetical, but it’s a likely scenario depending on the timing of the writes.

Picture a service company that’s gaining more and more customer contact information over time. The service company may know its client’s first and last name but not learn about its middle name until hours or weeks later in subsequent service requests. This scenario would result in parts of Row 00001 being persisted to the HDFS in different HFiles.

Until the HBase system performs a minor compaction, reading from Row 00001 would require three disk reads to retrieve the relevant HFile content! Minor compactions seek to minimize system overhead while keeping the number of HFiles under control. HBase designers took special care to give the HBase administrator as much tuning control as possible to make any system impact “minor.”

As its name implies, a major compaction is different from the perspective of a system impact. However, the compaction is quite important to the overall functionality of the HBase system. A major compaction seeks to combine all HFiles into one large HFile.

In addition, a major compaction does the cleanup work after a user deletes a record. When a user issues a Delete call, the HBase system places a marker in the key-value pair so that it can be permanently removed during the next major compaction.

Additionally, because major compactions combine all HFiles into one large HFile, the time is right for the system to review the versions of the data and compare them against the time to live (TTL) property. Values older than the TTL are purged.

Time to live refers to the variable in HBase you can set in order to define how long data with multiple versions will remain in HBase.

You may have guessed that a major compaction significantly affects the system response time. Users who are trying to add, retrieve, or manipulate data in the system during a major compaction, they may see poor system response time.

In addition, the HBase cluster may have to split regions at the same time that a major compaction is taking place and balance the regions across all RegionServers. This scenario would result in a significant amount of network traffic between RegionServers.

For these reasons, your HBase administrator needs to have a major compaction strategy for your deployment.