The Apache Hadoop Ecosystem
Hadoop is more than MapReduce and HDFS (Hadoop Distributed File System): It’s also a family of related projects (an ecosystem, really) for distributed computing and large-scale data processing. Most (but not all) of these projects are hosted by the Apache Software Foundation. The table lists some of these projects.
|Ambari||An integrated set of Hadoop administration tools for installing, monitoring, and maintaining a Hadoop cluster. Also included are tools to add or remove slave nodes.|
|Avro||A framework for the efficient serialization (a kind of transformation) of data into a compact binary format|
|Flume||A data flow service for the movement of large volumes of log data into Hadoop|
|HBase||A distributed columnar database that uses HDFS for its underlying storage. With HBase, you can store data in extremely large tables with variable column structures.|
|HCatalog||A service for providing a relational view of data stored in Hadoop, including a standard approach for tabular data|
|Hive||A distributed data warehouse for data that is stored in HDFS; also provides a query language that’s based on SQL (HiveQL)|
|Hue||A Hadoop administration interface with handy GUI tools for browsing files, issuing Hive and Pig queries, and developing Oozie workflows|
|Mahout||A library of machine learning statistical algorithms that were implemented in MapReduce and can run natively on Hadoop|
|Oozie||A workflow management tool that can handle the scheduling and chaining together of Hadoop applications|
|Pig||A platform for the analysis of very large data sets that runs on HDFS and with an infrastructure layer consisting of a compiler that produces sequences of MapReduce programs and a language layer consisting of the query language named Pig Latin|
|Sqoop||A tool for efficiently moving large amounts of data between relational databases and HDFS|
|ZooKeeper||A simple interface to the centralized coordination of services (such as naming, configuration, and synchronization) used by distributed applications|
The Hadoop ecosystem and its commercial distributions continue to evolve, with new or improved technologies and tools emerging all the time.
The figure shows the various Hadoop ecosystem projects and how they relate to one-another: