The Attributes of HBase - dummies

The Attributes of HBase

By Dirk deRoos

HBase (Hadoop Database) is a Java implementation of Google’s BigTable. Google defines BigTable as a “sparse, distributed, persistent multidimensional sorted map.” It’s quite a concise definition, but you’ll also agree that it’s a bit on the complex side. To break down BigTable’s complexity a bit, following is a discussion of each attribute.

Hbase is sparse

As you might have guessed, the BigTable distributed data storage system was designed to meet the demands of big data. Now, big data applications store lots of data but big data content is also often variable. Imagine a traditional table in a company database storing customer contact information, as shown:

Traditional Customer Contact Information Table
Customer ID Last Name First Name Middle Name E-mail Address Street Address
00001 Smith John Timothy 1 Hadoop Lane, NY 11111
00002 Doe Jane NULL NULL 7 HBase Ave, CA 22222

A company or individual may require a complete data record for each of its customers or constituents. A good example is your doctor, who needs all your contact information in order to provide you with proper care. Other companies or individuals may require only partial contact information or may need to learn that information over time.

For example, a customer service company may process phone calls or e-mail messages for service requests. Clients may or may not choose to give service companies all their contact information. However, with each interaction over time, companies may learn more about their clients that will enable them to provide better service — by issuing proactive service alerts, for example.

In this context, sparse means that fields in rows can be empty or NULL but that doesn’t bring HBase to a screeching halt. HBase can handle the fact that you don’t (yet) know Jane Doe’s middle name and e-mail address, for example.

Here’s another example: a database for storing satellite images. It turns out that Google uses BigTable technology to store satellite imagery of the earth. In almost every case, whenever imagery is stored, metadata is also stored with it.

The metadata may include the street address of the image or only the latitude and longitude if the image is captured from the wilderness. The metadata is variable in content so some fields will be NULL — and that’s OK.

In both examples, the data sets that are collected can be extremely large — especially in the second example. Imagery databases are almost always measured in terabytes or sometimes in petabytes.

HBase is designed for storing big data, but it’s also designed for storing sparse data records at no cost. This concern is crucial when you’re using big data applications! Storing a few NULL records over a million rows is wasteful, but try to imagine the waste over a quadrillion rows!

Thankfully, this was a key consideration for Google designers and the HBase community. Sparse data is supported with no waste of costly storage space.

And it doesn’t stop there. Consider the power of a schema-less data store. The table shows you a classic customer contact table. When companies design these tables, they know up front what they want to store. In other words the schema is fixed; it’s defined even before the first byte of information is stored in the table.

Now what if, over time, a new field is needed for a customer? How about a Twitter handle or a new mobile phone number? You’re seemingly stuck with a schema that no longer works for you.

Well, HBase solves this challenge as well — you can not only skip fields at no cost when you don’t have the data, but also dynamically add fields (or columns in the HBase vernacular) over time without having to redesign the schema or disrupt operations.

So you can think of HBase as a schema-less data store; that is, it’s fluid — you can add to, subtract from, or modify the schema as you go along.

HBase is distributed and persistent

BigTable is a distributed and persistent data store. Persistent simply means that the data you store in BigTable (and HBase, for that matter) will persist or remain after your program or session ends. That’s pretty straightforward — persistent means that it persists — but you should spend a little more time thinking about how the data is persisted.

In its BigTable paper, Google described the distributed file system known as Google File System or GFS. It turns out that, just as HBase is an open source implementation of BigTable, HDFS is an open source implementation of GFS.

By default, HBase leverages HDFS to persist its data to disk storage. Though other distributed data stores can be used with HBase, the vast majority of HBase installations leverage HDFS. This makes perfect sense given that HBase is the “Hadoop Database” — hey, it’s built into the name, for goodness sake.

HDFS is a key enabling technology not only for Hadoop but also for HBase. By storing data in HDFS, HBase offers reliability, availability, seamless scalability, high performance and much more — all on cost effective distributed servers!

HBase has a multidimensional sorted map

Starting from the basics, a map (also known as an associative array) is an abstract collection of key-value pairs, where the key is unique. This definition is crucial to your understanding of HBase because the HBase data model is often described in different ways — often incompletely as a column-oriented store.

HBase is, at bottom, a key-value data store where each key is unique — meaning it appears at most once in the HBase data store. Additionally, the map is sorted and multidimensional. The keys are stored in HBase and sorted in byte-lexicographical order. Each value can have multiple versions, which makes the data model multidimensional. By default, data versions are implemented with a timestamp.