10 Emerging Hadoop Technologies to Keep Your Eye On
With Hadoop hitting mainstream IT with a vengeance, open source projects related to Hadoop are popping up everywhere. Here are the top ten most interesting emerging Hadoop projects for you to keep your eye on. Some of them could well stagnate and die quietly if a superior replacement were to come along, but most of these evolutionary specimens will probably become standard components in most Hadoop distributions.
This list focuses on the Apache community projects because this ecosystem has been the one where the majority of the existing mainstream Hadoop projects are developed and maintained. Also, Apache projects have solid governance criteria that foster an open development process where contributions from its members are judged on their technical merit rather than on a corporate agenda.
Apache Accumulo is a data storage project for Hadoop, originally developed by the National Security Agency (NSA) of the United States government. Accumulo is a BigTable implementation for Hadoop. Specifically, Accumulo is a multidimensional sorted map, where each row has a unique key, the rows are stored in sorted order based on this key, and each row can have multiple versions (in other words, dimensions).
There was much interest in the NSA in using HBase as a large-scale data store, but it didn't meet the NSA's internal security requirements. NSA engineers then built Accumulo as their own BigTable implementation and later contributed it to the Apache community. The Accumulo project has since grown an active development community, with contributors from a number of different organizations — not just NSA types, in other words. Accumulo, now supported by a number of major Hadoop vendors, is seeing an increasing adoption rate.
The major feature distinguishing Accumulo from other BigTable implementations is cell-based security, which ensures that only authorized users can see the data stored in any queried rows. This is implemented by way of security labels, which are stored with each row.
A number of emerging and competing technologies are out there trying to solve the SQL-on-Hadoop problem. Though most of these technologies are single-company solutions, some of them are community driven, with Hive the most prominent example. Apache Drill is inspired by the Google Dremel paper, which presents a design for an interactive system that can query data stored in a distributed file system like HDFS and not have to rely on MapReduce. The design goal for Drill is to be able to scale out to thousands of servers and provide subminute response times for queries operating against petabyte-scale data.
As of Spring 2014, Drill is still an Apache incubator project, which means that it hasn't yet been accepted as a full-fledged Apache project and is still establishing a stable code base and project governance. But it has great potential, so don't be surprised if it makes its way out of the incubator soon.
With the increased integration of Hadoop in data warehousing environments, the industry is seeing a significant need for data integration and governance capabilities in Hadoop. Current approaches for integrating data and meeting governance criteria involve these two choices:
Buy such tooling from established vendors like IBM and Informatica.
Write extensive libraries of custom code.
This is what the Apache Falcon project is aiming to address with a set of data management services built specifically for Hadoop. Like Drill, Falcon is an Apache incubator project.
The data management services in Falcon are focused primarily on managing data movement and data transformation. If you aren't familiar with managing data between transactional databases and warehouse databases, this process of data movement and transformation is commonly known as Extract, Transform, and Load (ETL). As part of the framework for handling ETL processes, Falcon includes the ability to store metadata for the data as it is passed through the various ETL stages. Falcon can then provide services for data lifecycle management (for example, executing retention policies), data replication, and tracking data lineage.
Hadoop is quite good at storing and processing data in traditional tables (Hive) and in the newer, BigTable style (HBase and Accumulo), but in many cases, alternative data storage structures are more suited to the task at hand. Graph data looks quite different from table data: It has no rows or columns. There is simply a graph, where individual nodes (also known as vertices) are connected to each other by edges.
Think about it: One huge technical challenges that Google faces is figuring out how best to calculate the ranking of search results. One factor in this is determining how popular individual web pages are, based on how many inbound links originate from other web pages. By far the most practical way to calculate this for all pages is to represent the entire World Wide Web as a graph, where the pages are the nodes and the links are the vertices. To capture its graph database work, Google published a paper on its graph database, which is named Pregel.
Apache Giraph, a graph processing engine that is based on the Pregel paper and was built specifically for Hadoop, can read and write data from a number of standard Hadoop sources, including Hive, HBase, and Accumulo.
The Giraph community is quite large and diverse, featuring code committers from a number of organizations, including Facebook, Twitter, and LinkedIn. Giraph is firmly established as the leading graph processing engine for Hadoop, in terms of code maturity, performance, and adoption. Major Hadoop vendors are now supporting Giraph and will likely include it. (The Apache BigTop project already does.)
As a distributed system with hundreds or thousands of individual computers, Hadoop clusters are a security administrator's nightmare. To make matters worse, the compute nodes in a Hadoop cluster all have multiple services that talk to each other and, in some cases, require direct connectivity with client applications. Add up all these factors and you have a massive surface area of computers with open ports that you need to protect. To solve this problem, Hortonworks has started the Apache Knox Gateway project, which is still in its early days as an Apache incubator project.
The main objective of Knox Gateway is to provide perimeter security for Hadoop clusters. It accomplishes this by providing a central point for cluster authentication on the edge of a Hadoop cluster. Outside the cluster itself, Knox Gateway handles all inbound client requests to a cluster it's guarding and then routes valid requests to the appropriate service in the Hadoop cluster. In this sense, Knox Gateway is literally a secure gateway for all communications between the Hadoop cluster and the outside world. Knox Gateway allows network administrators to isolate the Hadoop cluster from the outside world, because as long as the Knox Gateway servers are active, clients have a secure connection to their Hadoop services.
One exciting aspect of YARN is the possibility of running different kinds of workloads on a Hadoop cluster. With MapReduce, you're restricted to batch processing, but with new technologies such as Spark and Tez (which we talk about below) and the aforementioned Drill, Hadoop will be able to support interactive queries as well. Another class of workload is streaming data, which is what the Apache Samza project is aiming to tackle. (Streaming data works to handle data in real time instead of relying on the stop-and-go aspect of batch processing.)
The Samza project was started by engineers from LinkedIn, which needed a streaming data engine. Rather than keep their code in-house, LinkedIn engineers are developing Samza in the open source Apache community. At the time of this writing, Samza is still in its early days as an Apache incubator project. Though other streaming data engines exist (such as Spark Streaming and Storm, discussed below), the LinkedIn team decided to build its own engine that would best suit its needs.
The section about the Knox Gateway project above features some of the security challenges with Hadoop. Though Knox Gateway addresses system authorization (ensuring that users are allowed to connect to the Hadoop cluster's services), it doesn't address the pressing need of data authorization, where there are business needs for restricting access to subsets of data. A common example is the need to hide tables that contain sensitive data such as credit card numbers from analysts looking for behavior patterns. The Apache Sentry project was started by Cloudera as a way to provide this kind of access control to data stored in its Impala project and in Hive. As of Spring 2014, Sentry is an Apache incubator project.
Sentry introduces the concept of different user role classes to Hadoop while enabling the classification of data assets in Impala or Hive. Depending on the classification that's applied at the database, table, or view level, only users with the appropriate roles would be able to access data.
The Apache Spark project quickly became a household name (at least in Hadoop households) in 2014 when it became a top-level Apache project (meaning that it graduated from incubator status) and a number of Hadoop distribution companies lined up to announce support. Spark, as a cluster computing framework, is yet another project that's realizing the enormous potential YARN brings to Hadoop in supporting different data processing frameworks.
Spark was originally developed by researchers from UC Berkeley, who created the company Databricks back in 2013 to commercialize it, quickly gaining $14 million in venture capital funding.
The excitement around Spark comes from its relative simplicity compared to MapReduce and its much greater flexibility for streaming and interactive workloads. In further contrast to MapReduce, Spark does its data processing in-memory, which yields considerable performance benefits. At the same time, it can process larger data sets that don't fit in memory from disk, but it still provides performance benefits because Spark doesn't need to adhere to MapReduce's rigid map and reduce cycles, which often aren't optimal for many algorithms.
As a general framework, Spark has a number of child projects for more specialized data processing: Spark Streaming for streaming real-time data feeds; Shark, for interactive SQL queries; Machine Learning Library (MLlib) for machine learning; and GraphX for graph processing.
Apache Storm is the third streaming data analysis engine covered in this article (with Samza and Spark Streaming as the other two), which is a testament to how much attention real-time analytics is getting in the Hadoop community. But these divergent approaches are also indications that it's still early in the evolution of streaming data analysis on Hadoop, because none of these three has emerged as a leader. Storm has been an active project the longest, having been donated to the open source community after being acquired by Twitter in 2011. Storm is now an Apache incubator project.
Thanks to work by Hortonworks developers who brought it into the Apache community, Storm was retrofitted to work with the YARN framework. This brought Storm into the Hadoop ecosystem as a real-time processing alternative.
Similar to what is happening with streaming data analysis engines, a number of alternatives have emerged with MapReduce for interactive distributed processing. Spark is a prominent example of these frameworks. The other leading example of such a framework is Apache Tez, which is largely driven by Hortonworks.
The Hortonworks solution to the SQL-on-Hadoop challenge is to improve Hive. To meet this challenge, Hortonworks announced its Stinger initiative, which involved a number of changes to Hive, involving better support for the ANSI SQL standards and much improved performance. One key limitation in Hive is its dependence on MapReduce for processing queries. MapReduce is limited in its ability to deal with common SQL operations such as joins and group-bys, which results in extremely poor performance compared to the massively parallel relational database alternatives running at a comparable large scale. Hortonworks announced the Tez project to present an alternative framework to MapReduce, which is optimized for more optimal (and flexible) data processing possibilities. Tez will also be used as the underlying framework for Pig.