Transitioning from an RDBMS model to HBase
If you’re facing the design phase for your application and you believe that HBase would be a good fit, then designing your row keys and schema to fit the HBase data model and architecture is the right approach. However, sometimes it makes sense to move a database originally designed for an RDBMS to HBase.
A common scenario where this approach makes sense is a MySQL database instance that has reached its limits of scalability. Techniques exist for horizontally scaling a MySQL instance (sharding, in other words) but this process is usually cumbersome and problematic because MySQL simply was not originally designed for sharding.
Transitioning from the relational model to the HBase model is a relatively new discipline. However, certain established patterns of thought are emerging and have coalesced into three key principles to follow when approaching a transition. These principles are denormalization, duplication, and intelligent keys (DDI).
Denormalization: The relational database model depends on a) a normalized database schema and b) joins between tables to respond to SQL operations. Database normalization is a technique which guards against data loss, redundancy, and other anomalies as data is updated and retrieved.
There are a number of rules the experts follow to arrive at a normalized database schema (and database normalization is a whole study itself), but the process usually involves dividing larger tables into smaller tables and defining relationships between them. Database denormalization is the opposite of normalization, where smaller, more specific tables are joined into larger, more general tables.
This is a common pattern when transitioning to HBase because joins are not provided across tables, and joins can be slow since they involve costly disk operations. Guarding against the update and retrieval anomalies is now the job of your HBase client application, since the protections afforded you by normalization are null and void.
Duplication: As you denormalize your database schema, you will likely end up duplicating the data because it can help you avoid costly read operations across multiple tables. Don’t be concerned about the extra storage (within reason of course); you can use the automatic scalability of HBase to your advantage.
Be aware, though, that extra work will be required by your client application to duplicate the data and remember that natively HBase only provides row level atomic operations not cross row (with the exception described in the HBASE-5229 JIRA) or cross table.
Intelligent Keys: Because the data stored in HBase is ordered by row key, and the row key is the only native index provided by the system, careful intelligent design of the row key can make a huge difference. For example, your row key could be a combination of a service order number and the customer’s ID number that placed the service order.
This row key design would allow you to look up data related to the service order or look up data related to the customer using the same row key in the same table. This technique will be faster for some queries and avoid costly table joins.
To clarify these particular patterns of thought, take a Customer Contact Information table and place it within the context of a typical service order database. The figure shows you what a normalized service order database schema might look like.
Following the rules of RDBMS normalization, set up the sample Customer Contact Information table so that it is separate from the service order table in order to avoid losing customer data when service orders are closed and possibly deleted. Take the same approach for the Products table, which means that new products can be added to the fictional company database independently of service orders.
By relying on RDBMS join operations, this schema supports queries that reveal the number of service orders that are opened against a particular product along with the customer’s location where the product is in use.
That’s all fine and dandy, but it’s a schema you’d use with RDBM. How do you transition this schema to an HBase schema? The next figure illustrates a possible HBase scheme — one that follows the DDI design pattern.
The Customer Contact Information table has been denormalized by including the customer name and contact info in place of the foreign keys used previously. Also, the data is duplicated by keeping the Customer Contact Information table as is. Now joins across the Service Order table and Customer Contact Information table are not necessary.
Additionally, an intelligent row key design has been employed that combines the product number with the customer number to form the service order number (A100|00001, for example). Using this intelligent key, the service order table can provide vital reports about product deficiencies and customers who are currently experiencing product issues.
All these queries can all be supported by HBase in a row level atomic fashion for the application. Because you know that HBase orders row keys and sorts them in a lexicographical fashion, your application can make certain educated guesses about data locality when issuing scans for reporting. (All A* series product numbers will be stored together, for example.)
The service order database represented by the HBase schema is a relatively simple example, but it illustrates how HBase can, in certain cases, intersect with the RDBMS world and provide significant value. If the fictional company has terabytes or even petabytes of service call data to store, HBase would make a huge difference in terms of cost, reliability, performance, and scale.
You can, of course, design your service order HBase schema in several different ways. Admittedly, the design all depends on the queries that must be supported, but you have the ability to transition some relational databases to very powerful HBase applications for production use as long as you work from a solid understanding of HBase architecture and the DDI design pattern.
This example has assumed that queries were performed by a Java application leveraging the HBase client APIs, or perhaps via another language using Apache Thrift. This application model may fit the requirements just fine and provide useful performance and customization options for the fictional service company.