IBM Big SQL and Hadoop

By Dirk deRoos

IBM has a long history of working with SQL and database technology. In keeping with this history, IBM’s solution for SQL on Hadoop leverages components from its relational database technologies that are ported to run on Hadoop.

If you’re at all familiar with IBM’s product naming for its Big Data products and features, you can easily guess what they’ve named their SQL on Hadoop solution: Big SQL. The goal of Big SQL is to provide a SQL interface on Hadoop that gives users as much as possible of what they’re used to with SQL interfaces for relational databases.

This means extensive query syntax support, fast performance that doesn’t require users having to monkey with their queries, and the ability to control data security.

The figure shows a partial deployment of BigInsights, IBM’s Hadoop distribution running Big SQL.


Here, you can see a subset of the master nodes and data nodes behind the BigInsights firewall. One of the master nodes is running the Big SQL server, which includes IBM’s SQL compiler and optimizer. Also included on this master node is a catalog, where metadata and statistics about any cataloged data in HDFS is stored for use by the compiler/optimizer.

Subsections of queries are sent to the applicable data nodes where requested data is stored, and there the Big SQL Runtime (which is IBM’s SQL runtime) executes the workload. Rather than run mapper and reducer processes and persist files with intermediate result sets, Big SQL uses continuously running daemons that pass messages between each other.

It’s important to note that the data being queried is stored and managed by Hadoop. Big SQL supports standard Hadoop file formats — for example, RCFile and Parquet.

Big SQL provides the same extensive SQL support as the IBM relational database products — for example, ANSI SQL-2011, and compatibility for IBM’s SQL Procedural Language (SQL/PL). (At the time of writing, IBM was working on providing support for Oracle’s SQL dialect and their PL/SQL procedural language.)

Along with the standard IBM SQL engine come a number of other capabilities, most notably IBM’s row- and column-based security (also known as Fine-Grained Access Control, or FGAC), where only specific users can be authorized to see certain sets of data rows or columns.

Big SQL comes with the standard IBM Data Server Client, which includes a driver package. Traditional database applications can connect to the BigInsights Hadoop cluster and securely exchange encrypted data over SSL.