What SQL Access Actually Means

By Dirk deRoos

A number of companies are investing heavily to drive open source projects and proprietary solutions for SQL access to Hadoop data. When you hear the term SQL access, you should knowing that you’re relying on a few basic assumptions:

  • Language standards: The most important standard, of course, entails the language itself. Many “SQL-like” solutions exist, though they usually don’t measure up in certain fundamental ways — ways that would prevent even typical SQL statements from working.

    The American National Standards Institute (ANSI) established SQL as an official technical standard, and the IT industry accepts the ANSI SQL-92 standard as representing the benchmark for basic SQL compliance. ANSI has released a number of progressively more advanced versions over the years as database technologies have evolved.

  • Drivers: Another key component in a SQL access solution is the driver — the interface for applications to connect and exchange data with the data store. Without a driver, there’s no SQL interface for any client applications or tools to connect to for the submission of SQL queries.

    As such, any SQL on Hadoop solution has to have JDBC and ODBC drivers at the very least, because they’re the most commonly used database interface technologies.

  • Real-time access: Until Hadoop 2, MapReduce-based execution was the only available option for analytics against data stored in Hadoop. For relatively simple queries involving a full scan of data in a table, Hadoop was quite fast as compared to a traditional relational database.

    Keep in mind that this is a batch analysis use case, where fast can mean hours, depending on how much data is involved. But when it came to more complex queries, involving subsets of data, Hadoop did not do well. MapReduce is a batch processing framework, so achieving high performance for real-time queries before Hadoop 2 was architecturally impossible.

    One early motivator for YARN, the new resource management and scheduling system on the block, was this need to support other processing frameworks to enable real-time workloads, such as interactive SQL queries. Indeed, a proper SQL solution should not leave people waiting for reasonable queries.

  • Mutable data: A common question in many discussions around SQL support on Hadoop is “Can we use , , and statements, as we would be able to do in a typical relational database?” For now, the answer is no, which reflects the nature of HDFS — it’s focused on large, immutable files. Technologies such as Hive offer read-only access to these files. Regardless, work is ongoing in the Hive Apache project.