Columnar Data in NoSQL

By Adam Fowler

Column stores in NoSQL are similar at first appearance to traditional relational DBMS. The concepts of rows and columns are still there. You also define column families before loading data into the database, meaning that the structure of data must be known in advance.

However, column stores organize data differently than relational databases do. Instead of storing data in a row for fast access, data is organized for fast column operations. This column‐centric view makes column stores ideal for running aggregate functions or for looking up records that match multiple columns.

Aggregate functions are data combinations or analysis functions. They can be as simple as counting the number of results, summing them, or calculating their mean average. They could be more complex, though — for example, returning a complex value describing an overarching range of time.

Column stores are also sometimes referred to as Big Tables or Big Table clones, reflecting their common ancestor, Google’s Bigtable.

Perhaps the key difference between column stores and a traditional RDBMS is that, in a column store, each record (think row in an RDBMS) doesn’t require a single value per column. Instead, it’s possible to model column families. A single record may consist of an ID field, a column family for “customer” information, and another column family for “order item” information.

Each one of these column families consists of several fields. One of these column families may have multiple “rows” in its own right. Order item information, for example, has multiple rows — one for each line item. These rows will contain data such as item ID, quantity, and unit price.

A key benefit of a column store over an RDBMS is that column stores don’t require fields to always be present and don’t require a blank padding null value like an RDBMS does. This feature prevents the sparse data problem, preserving disk space. An example of a variable and sparse data set is shown here.

image0.jpg

The great thing about column stores is that you can retrieve all related information using a single record ID, rather than using the complex Structured Query Language (SQL) join as in an RDBMS. Doing so does require a little upfront modeling and data analysis, though.

In the example shown, you can retrieve all order information by selecting a single column store row, which means the developer doesn’t need to be aware of the exact complex join syntax of a query in a column store, unlike they would have to be using complex SQL joins in an RDBMS.

So, for complex and variable relational data structures, a column store may be more efficient in storage and less error prone in development than its RDBMS ancestors.

Note that, in the item column family, each item’s ID is represented within the key, and the value is the quantity ordered. This setup allows for fast lookup of all orders containing this item ID.

If you know the data fields involved up front and need to quickly retrieve related data together as a single record, then consider a column store.