Data Versions in the HBase Data Model

By Dirk deRoos

You can see a number between the column qualifier and value (‘FN’: 1383859182496:‘John,’ for example). That number is the version number for each value in the table. Values stored in HBase are time stamped by default, which means you have a way to identify different versions of your data right out of the box.

Logical View of Customer Contact Information in HBase
Row Key Column Family: {Column Qualifier:Version:Value}
00001 CustomerName: {‘FN’:
‘LN’: 1383859182858:‘Smith’,
‘MN’: 1383859183001:’Timothy’,
‘MN’: 1383859182915:’T’}
ContactInfo: {‘EA’:
’SA’: 1383859183073:’1 Hadoop Lane, NY
00002 CustomerName: {‘FN’:
‘LN’: 1383859183163:‘Doe’,
ContactInfo: {
’SA’: 1383859185577:’7 HBase Ave, CA

It’s possible to create a custom versioning scheme, but users typically go with a time stamp created using the current Unix time. (The Unix time or Unix epoch represents the number of milliseconds since midnight January 1, 1970 UTC.) The versioned data is stored in decreasing order, so that the most recent value is returned by default unless a query specifies a particular timestamp.

You can see that the fictional service company at first only had an initial for John Smith’s middle name but then later on they learned that the “T” stood for “Timothy.” The most recent value for the ‘MN’ column is stored first in the table.

You can set a limit on the amount of time that data can stay in HBase with a variable called time to live (TTL). You can also set a variable which controls the number of versions per value. This can be done per column family.