Google Dremel and Hadoop

By Dirk deRoos

For most people, the term Dremel brings to mind a handy high-speed, low-torque tool that works well for a variety of jobs around the house. But did you know that Google created a Dremel? Rather than produce another handheld mechanical tool, though, Google chose a fast software tool intended for interactive analysis of big data.

As with other Google technologies that inspired parts of the Hadoop ecosystem, such as MapReduce, Google File System (HDFS), and BigTable (see HBase), Google developed Dremel for use internally and then published a paper describing the purpose and design of the technology. (In other words, Dremel is not something you can download and use on your Hadoop cluster.)

Google uses Dremel for a variety of jobs, including analyzing web-crawled documents, detecting e-mail spam, working through application crash reports, and more. Google’s BigQuery service actually uses Dremel.

Google designed MapReduce technology for batch processing over massive sets of data. As their needs evolved, so did their technology, and Google decided to create Dremel to improve performance for interactive queries against big data sets.

The MapReduce approach provides scalability and query fault tolerance, but it’s fundamentally a batch-based system, so response times for smaller queries (queries involving only a small part of an entire data set, for instance) are often not what users expect.

So Google developed a query execution technology designed for interactive queries, which runs on intermediate servers on top of the Google File System (GFS). (Remember, GFS was the inspiration for Apache HDFS, which is Hadoop’s file system.)

Similar to Hive, Dremel uses an SQL-like language (familiar to most programmers) and employs a columnar data layout. Dremel provides fast, interactive query response while preserving the scalability and fault tolerance found in Apache Hive. In the Dremel whitepaper, Google explains how it can perform aggregation queries within seconds over tables with a trillion rows — not bad at all.

So Google has its Dremel technology, which it uses internally, but then there are all the technologies “inspired by” Dremel (kind of like all those perfumes “inspired by” Drakkar Noir).