Row Keys in the HBase Data Model - dummies

Row Keys in the HBase Data Model

By Dirk deRoos

HBase data stores consist of one or more tables, which are indexed by row keys. Data is stored in rows with columns, and rows can have multiple versions. By default, data versioning for rows is implemented with time stamps.

Logical View of Customer Contact Information in HBase
Row Key Column Family: {Column Qualifier:Version:Value}
00001 CustomerName: {‘FN’:
‘LN’: 1383859182858:‘Smith’,
‘MN’: 1383859183001:’Timothy’,
‘MN’: 1383859182915:’T’}
ContactInfo: {‘EA’:
’SA’: 1383859183073:’1 Hadoop Lane, NY
00002 CustomerName: {‘FN’:
‘LN’: 1383859183163:‘Doe’,
ContactInfo: {
’SA’: 1383859185577:’7 HBase Ave, CA

For the sake of illustration, the table has two simple row keys: 00001 and 00002. Row keys are implemented as byte arrays, and are sorted in byte-lexicographical order, which simply means that the row keys are sorted, byte by byte, from left to right.

If you think in terms of numeric values when designing row keys, then sorting is simple. Given two keys, if the byte at Index 1 in Key 1 is less than the byte at Index 1 in Key 2, Row Key 1 will always be stored before Row Key 2, no matter what’s next in the sequence of bytes.

However, it’s common to use printable (ASCII) characters rather than numeric values for row keys in HBase and if you do, you need to understand that the Java language represents characters using the Unicode Standard. The following example illustrates this design consideration for Basic Latin (ASCII).

“RowA” precedes “RowA”Row-1” precedes “Row11″Row1” precedes “RowA”

You may wonder why you would bother with this fine detail with respect to row keys. The reason for this special attention is that proper row key design is crucial to achieving good performance in HBase — not doing so means you won’t realize the full value of your HBase cluster. Sorted row keys can help you access your data faster.