The Apache Hadoop Ecosystem

Hadoop is more than MapReduce and HDFS (Hadoop Distributed File System): It’s also a family of related projects (an ecosystem, really) for distributed computing and large-scale data processing. Most (but not all) of these projects are hosted by the Apache Software Foundation. The table lists some of these projects.

Related Hadoop Projects
Project Name Description
Ambari An integrated set of Hadoop administration tools for installing, monitoring, and maintaining a Hadoop cluster. Also included are tools to add or remove slave nodes.
Avro A framework for the efficient serialization (a kind of transformation) of data into a compact binary format
Flume A data flow service for the movement of large volumes of log data into Hadoop
HBase A distributed columnar database that uses HDFS for its underlying storage. With HBase, you can store data in extremely large tables with variable column structures.
HCatalog A service for providing a relational view of data stored in Hadoop, including a standard approach for tabular data
Hive A distributed data warehouse for data that is stored in HDFS; also provides a query language that’s based on SQL (HiveQL)
Hue A Hadoop administration interface with handy GUI tools for browsing files, issuing Hive and Pig queries, and developing Oozie workflows
Mahout A library of machine learning statistical algorithms that were implemented in MapReduce and can run natively on Hadoop
Oozie A workflow management tool that can handle the scheduling and chaining together of Hadoop applications
Pig A platform for the analysis of very large data sets that runs on HDFS and with an infrastructure layer consisting of a compiler that produces sequences of MapReduce programs and a language layer consisting of the query language named Pig Latin
Sqoop A tool for efficiently moving large amounts of data between relational databases and HDFS
ZooKeeper A simple interface to the centralized coordination of services (such as naming, configuration, and synchronization) used by distributed applications

The Hadoop ecosystem and its commercial distributions continue to evolve, with new or improved technologies and tools emerging all the time.

The figure shows the various Hadoop ecosystem projects and how they relate to one-another:

image0.jpg
  • Add a Comment
  • Print
  • Share
blog comments powered by Disqus
Advertisement

Inside Dummies.com