Column Families in the HBase Data Model

By Dirk deRoos

In the HBase data model columns are grouped into column families, which must be defined up front during table creation. Column families are stored together on disk, which is why HBase is referred to as a column-oriented data store.

Logical View of Customer Contact Information in HBase
Row Key Column Family: {Column Qualifier:Version:Value}
00001 CustomerName: {‘FN’:
1383859182496:‘John’,
‘LN’: 1383859182858:‘Smith’,
‘MN’: 1383859183001:’Timothy’,
‘MN’: 1383859182915:’T’}
ContactInfo: {‘EA’:
1383859183030:‘John.Smith@xyz.com’,
’SA’: 1383859183073:’1 Hadoop Lane, NY
11111’}
00002 CustomerName: {‘FN’:
1383859183103:‘Jane’,
‘LN’: 1383859183163:‘Doe’,
ContactInfo: {
’SA’: 1383859185577:’7 HBase Ave, CA
22222’}

The table shows two column families: CustomerName and ContactInfo. When creating a table in HBase, the developer or administrator is required to define one or more column families using printable characters.

Generally, column families remain fixed throughout the lifetime of an HBase table but new column families can be added by using administrative commands. The official recommendation for the number of column families per table is three or less. (See the Apache HBase online documentation.)

In addition, you should store data with similar access patterns in the same column family — you wouldn’t want a customer’s middle name stored in a separate column family from the first or last name because you generally access all name data at the same time.

Column families are grouped together on disk, so grouping data with similar access patterns reduces overall disk access and increases performance.