The Apache Hadoop Ecosystem
Hadoop is more than MapReduce and HDFS (Hadoop Distributed File System): It’s also a family of related projects (an ecosystem, really) for distributed computing and large-scale data processing. Most (but not all) of these projects are hosted by the Apache Software Foundation. The table lists some of these projects.
|Ambari||An integrated set of Hadoop administration tools for
installing, monitoring, and maintaining a Hadoop cluster. Also
included are tools to add or remove slave nodes.
|Avro||A framework for the efficient serialization (a kind of
transformation) of data into a compact binary format
|Flume||A data flow service for the movement of large volumes of log
data into Hadoop
|HBase||A distributed columnar database that uses HDFS for its
underlying storage. With HBase, you can store data in extremely
large tables with variable column structures.
|HCatalog||A service for providing a relational view of data stored in
Hadoop, including a standard approach for tabular data
|Hive||A distributed data warehouse for data that is stored in HDFS;
also provides a query language that’s based on SQL
|Hue||A Hadoop administration interface with handy GUI tools for
browsing files, issuing Hive and Pig queries, and developing Oozie
|Mahout||A library of machine learning statistical algorithms that were
implemented in MapReduce and can run natively on Hadoop
|Oozie||A workflow management tool that can handle the scheduling and
chaining together of Hadoop applications
|Pig||A platform for the analysis of very large data sets that runs
on HDFS and with an infrastructure layer consisting of a compiler
that produces sequences of MapReduce programs and a language layer
consisting of the query language named Pig Latin
|Sqoop||A tool for efficiently moving large amounts of data between
relational databases and HDFS
|ZooKeeper||A simple interface to the centralized coordination of services
(such as naming, configuration, and synchronization) used by
The Hadoop ecosystem and its commercial distributions continue to evolve, with new or improved technologies and tools emerging all the time.
The figure shows the various Hadoop ecosystem projects and how they relate to one-another: