10 Reasons to Adopt Hadoop

By Dirk deRoos

Hadoop is a powerful and flexible platform for large-scale data analysis. This statement alone is a compelling reason to consider using Hadoop for your analytics projects. To help further tip the scales, following are ten compelling reasons to deploy Hadoop as part of your big data solution.

Hadoop is relatively inexpensive

The cost per terabyte to implement a Hadoop cluster is cheaper than the per-terabyte cost to set up a tape backup system. Granted, a Hadoop system costs more to operate, because the disk drives holding the data are all online and powered, unlike tape drives. But this interesting metric still shows the tremendous potential value of an investment in Hadoop.

The primary reason Hadoop is inexpensive is its reliance on commodity hardware. Traditional solutions in enterprise data management depend on expensive resources to ensure high availability and fast performance.

Hadoop has an active open source community

Whenever an organization invests in a software package, a key consideration is the long-term relevance of the software it bought. No business wants to purchase software licenses and build specific skills around technologies that will be either obsolete or irrelevant in the coming months and years.

In that regard, you don’t need to worry about Hadoop. The Apache Hadoop project is on the path of long-term adoption and relevance. Its key projects have dozens of committers and hundreds of developers contributing code. Though a few of these people are academics or hobbyists, the majority of them are paid by enterprise software companies to help grow the Hadoop platform.

Hadoop is being widely adopted in every industry

As with the adoption of relational database technology from the 1980s and onward, Hadoop solutions are springing up in every industry. Most businesses with large-scale information management challenges are seriously exploring Hadoop. Broad consensus from media stories and analyst reports now indicate that almost every Fortune 500 company has embarked on a Hadoop project.

Hadoop can easily scale out as your data grows

Rising data volumes are a widespread big data challenge now faced by organizations. In highly competitive environments where analytics is increasingly becoming the deciding factor in determining winners and losers, being able to analyze those increasing volumes of data is becoming a high priority.

Even now, most traditional data processing tools, such as databases and statistical packages, require larger scale hardware (more memory, disk, and CPU cores) to handle the increasing data volumes. This scale-up approach is limiting and cost-ineffective, given the need for expensive components.

In contrast to the scale-up model, where faster and higher capacity hardware is added to a single server, Hadoop is designed to scale out with ease by adding data nodes. These data nodes, representing increased cluster storage capacity and processing power, can easily be added on the fly to an active cluster.

Traditional tools are integrating with Hadoop

With increased adoption, businesses are coming to depend on Hadoop and are using it to store and analyze critical data. With this trend comes an appetite for the same kinds of data management tools that people are accustomed to having for their traditional data sources, such as a relational database. Here are some of the more important application categories where you can see integration with Hadoop:

  • Business analysis tools

  • Statistical analysis packages

  • Data integration tools

Hadoop can store data in any format

One feature of Hadoop reflects a key NoSQL principle: Store data first, and apply any schemas after it is queried. One major benefit that accrues to Hadoop from acting in accordance with this principle is that you can literally store any kind of data in Hadoop: completely unstructured, binary formats, semistructured log files, or relational data.

But along with this flexibility comes a curse: After you store data, you eventually want to analyze it — and analyzing messy data can be difficult and time consuming. The good news here is that increasing numbers of tools can mitigate the analysis challenges commonly seen in large, messy data sets.

Hadoop is designed to run complex analytics

You can not only store just about anything in Hadoop but also run just about any kind of algorithm against that data. The machine learning models and libraries included in Apache Mahout are prime examples, and they can be used for a variety of sophisticated problems, including classifying elements based on a large set of training data.

Hadoop can process a full data set

For fraud-analysis types of use cases, industry data from multiple sources indicates that less than 3 percent of all returns and claims are audited. Granted, in many circumstances, such as election polling, analyzing small sample sets of data is useful and sufficient.

But when 97 percent of returns and claims are unaudited, even with good sampling rules, many fraudulent returns still occur. By being able to run fraud analysis against the entire corpus of data, you now get to decide whether to sample.

Hardware is being optimized for Hadoop

Intel is now a player in the Hadoop distribution market. This move by Intel was a shrewd one because its distribution work shows the seriousness and commitment behind its open source integration efforts.

With Hadoop, Intel sees a tremendous opportunity to sell more hardware. After all, Hadoop clusters can feature hundreds of nodes, all requiring processors, motherboards, RAM, and hard disk drives. Intel has been investing heavily in understanding Hadoop so that it can build Intel-specific hardware optimizations that its Hadoop contributors can integrate into open source Hadoop projects.

Other major hardware vendors (such as IBM, Dell, and HP) are also actively bringing Hadoop-friendly offerings to market.

Hadoop can increasingly handle flexible workloads

During the four-year lead-up to the release of Hadoop 2, a great deal of attention was directed at solving the problem of having a single point of failure (SPOF) with the HDFS NameNode. Though this particular success was no doubt an important improvement, since it did much to enable enterprise stability, YARN is a far more significant development.

Until Hadoop 2, the only processing that could be done on a Hadoop cluster was restricted to the MapReduce framework. This was acceptable for the log analytics use cases that Hadoop was originally built for, but with increased adoption came the real need for increased flexibility.