SQL’s Importance for Hadoop

By Dirk deRoos

There are compelling reasons that SQL has proven to be resilient. The IT industry has had 40 years of experience with SQL, since it was first developed by IBM in the early 1970s. With the increase in the adoption of relational databases in the 1980s, SQL has since become a standard skill for most IT professionals.

You can easily see why SQL has been so successful: It’s relatively easy to learn, and SQL queries are quite readable. This ease can be traced back to a core design point in SQL — the fact that it’s a declarative language, as opposed to an imperative language.

For a language to be declarative means that your queries deal only with the nature of the data being requested — ideally, there should be nothing in your query that determines how the processing should be executed. In other words, all you indicate in SQL is what information you want back from the system — not how to get it.

In contrast, with an imperative language (C, for example, or Java, or Python) your code consists of instructions where you define the actions you need the system to execute.

In addition to the (easily leveraged) skills of your SQL-friendly IT professionals, decades’ worth of database applications have also been built with SQL interfaces. When talking about how Hadoop can complement the data warehouse, it’s clear that organizations will store structured data in Hadoop. And as a result, they’ll run some of their existing application logic against Hadoop.

No one wants to pay for applications to be rewritten, so a SQL interface is highly desirable.

With the development of SQL interfaces to Hadoop data, an interesting trend is that commercial business analytics and data management tools are almost all jumping on the Hadoop bandwagon, including business intelligence reporting; statistical packages; Extract, Transform, and Load frameworks (ETL); and a variety of other tools. In most cases, the interface to the Hadoop data is Hive.