How to Choose a Hadoop Distribution

By Dirk deRoos

Commercial Hadoop distributions offer various combinations of open source components from the Apache Software Foundation and elsewhere — the idea is that the various components have been integrated into a single product, saving you the effort of having to assemble your own set of integrated components. In addition to open source software, vendors typically offer proprietary software, support, consulting services, and training.

How do you go about choosing a Hadoop distribution from the numerous options that are available? When it comes to setting up your own environment, you’re the one who has to choose, and that choice should be based on a set of criteria designed to help you make the best decision possible.

Not all Hadoop distributions have the same components (although they all have Hadoop’s core capabilities), and not all components in one particular distribution are compatible with other distributions.

The criteria for selecting the most appropriate distribution can be articulated as this set of important questions:

  • What do you want to achieve with Hadoop?

  • How can you use Hadoop to gain business insight?

  • What business problems do you want to solve?

  • What data will be analyzed?

  • Are you willing to use proprietary components, or do you prefer open source offerings?

  • Is the Hadoop infrastructure that you’re considering flexible enough for all your use cases?

  • What existing tools will you want to integrate with Hadoop?

  • Do your administrators need management tools? (Hadoop’s core distribution doesn’t include administrative tools.)

  • Will the offering you choose allow you to move to a different product without obstacles such as vendor lock-in? (Application code that’s not transferrable to other distributions or data stored in proprietary formats represent good examples of lock-in.)

  • Will the distribution you’re considering meet your future needs, insofar as you’re able to anticipate those needs?

One approach to comparing distributions is to create a feature matrix — a table that details the specifications and features of each distribution you’re considering. Your choice can then depend on the set of features and specs that best addresses the requirements around your specific business problems.

On the other hand, if your requirements include prototyping and experimentation, choosing the latest official Apache Hadoop distribution might prove to be the best approach. The most recent releases certainly have the newest most exciting features, but if you want stability you don’t want excitement. For stability, look for an older release branch that’s been available long enough to have some incremental releases (these typically include bug fixes and minor features).

Whenever you think about open source Hadoop distributions, give a moment’s thought (or perhaps many moments’ thought) to the concept of open source fidelity — the degree to which a particular distribution is compatible with the open source components on which it depends. High fidelity facilitates integration with other products that are designed to be compatible with those open source components. Low fidelity? Not so much.

The open source approach to software development itself is an important part of your Hadoop plans because it promotes compatibility with a host of third-party tools that you can leverage in your own Hadoop deployment. The open source approach also enables engagement with the Apache Hadoop community, which gives you, in turn, the opportunity to tap into a deeper pool of skills and innovation to enrich your Hadoop experience.

Because Hadoop is a fast-growing ecosystem, some parts continue to mature as the community develops tooling to meet industry demands. One aspect of this evolution is known as backporting, where you apply a new software modification or patch to a version of the software that’s older than the version to which the patch is applicable.

An example is NameNode failover: This capability is a part of Hadoop 2 but was backported (in its beta form) by a number of distributions into their Hadoop-1-based offerings for as much as a year before Hadoop 2 became generally available.

Not every distribution engages actively in backporting new content to the same degree, although most do it for items such as bug fixes. If you want a production license for bleeding-edge technology, this is certainly an option; for stability, however, it’s not a good idea.

The majority of Hadoop distributions include proprietary code of some kind, which frequently comes in the form of installers and a set of management tools. These distributions usually emerge from different business models.

For example, one business model can be summarized this way: “Establish yourself as an open source leader and pioneer, market your company as having the best expertise, and sell that expertise as a service.” Red Hat, Inc. is an example of a vendor that uses this model.

In contrast to this approach, the embrace-and-extend business model has vendors building capabilities that extend the capabilities of open source software. MapR and IBM, which both offer alternative file systems to the Hadoop Distributed File System (HDFS), are good examples.

People sometimes mistakenly throw the “fork” label at these innovations, making use of jargon used by software programmers to describe situations where someone takes a copy of an open source program as the starting point for their own (independent) development.

The alternative file systems offered by MapR and IBM are completely different file systems, not a fork of the open source HDFS. Both companies enable their customers to choose either their proprietary distributed file system or HDFS. Nevertheless, in this approach, compatibility is critical, and the vendor must stay up to date with evolving interfaces. Customers need to know that vendors can be relied on to support their extensions.