How to Use Apache Hadoop for Predictive Analytics

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

Apache Hadoop is a free, open-source software platform for writing and running applications that process a large amount of data for predictive analytics. It enables a distributed parallel processing of large datasets generated from different sources. Essentially, it’s a powerful tool for storing and processing big data.

Hadoop stores any type of data, structured or unstructured, from different sources — and then aggregates that data in nearly any way you want. Hadoop handles heterogeneous data using distributed parallel processing — which makes it a very efficient framework to use in analytic software dealing with big data. No wonder some large companies are adopting Hadoop, including Facebook, Yahoo!, Google, IBM, Twitter, and LinkedIn.

Before Hadoop, companies were unable to take advantage of big data, which was not analyzed and almost unusable. The cost to store that data in a proprietary relational database and create a structured format around it did not justify the benefits of analyzing that data and making use of it.

Hadoop, on the other hand, is making that task seamless — at a fraction of the cost — allowing companies to find valuable insights in the abundant data they have acquired and are accumulating.

The power of Hadoop lies in handling different types — in fact, any type — of data: text, speech, e-mails, photos, posts, tweets, you name it. Hadoop takes care of aggregating this data, in all its variety, and provides you with the ability to query all of the data at your convenience.

You don’t have to build a schema before you can make sense of your data; Hadoop allows you to query that data in its original format.

In addition to handling large amounts of varied data, Hadoop is fault-tolerant, using simple programs that handle the scheduling of the processing distributed over multiple machines. These programs can detect hardware failure and divert a task to another running machine. This arrangement enables Hadoop to deliver high availability, regardless of hardware failure.

Hadoop uses two main components (subprojects) to do its job: MapReduce and Hadoop Distributed File System. The two components work co-operatively:

  • MapReduce: Hadoop’s implementation of MapReduce is based on Google’s research on programming models to process large datasets by dividing them into small blocks of tasks. MapReduce uses distributed algorithms, on a group of computers in a cluster, to process large datasets. It consists of two functions:

    • The Map ( ) function which resides on the master node (networked computer). It divides the input query or task into smaller subtasks, which it then distributes to worker nodes that process the smaller tasks and pass the answers back to the master node. The subtasks are run in parallel on multiple computers.

    • The Reduce ( ) function collects the results of all the subtasks and combines them to produce an aggregated final result — which it returns as the answer to the original big query.

  • Hadoop Distributed File System (HDFS): HDFS replicates the data blocks that reside on other computers in your data center (to ensure reliability) and manages the transfer of data to the various parts of your distributed system.

Consider a database of two billion people, and assume you want to compute the number of social friends of Mr. X and arrange them according to their geographical locations. That’s a tall order.

The data for two billion people could originate in widely different sources such as social networks, e-mail contact address lists, posts, tweets, browsing histories — and that’s just for openers. Hadoop can aggregate this huge, diverse mass of data so you can investigate it with a simple query.

You would use MapReduce programming capabilities to solve this query. Defining Map and Reduce procedures makes even this large dataset manageable. Using the tools that the Hadoop framework offers, you would create a MapReduce implementation that would do the computation as two subtasks:

  • Compute the average number of social friends of Mr. X.

  • Arrange Mr. X’s friends by geographical location.

Your MapReduce implementation program would run these subtasks in parallel, manage communication between the subtasks, and assemble the results. Out of two billion people, you would know who Mr. X’s online friends are.

Hadoop provides a range of Map processors; which one(s) you select will depend on your infrastructure.

Each of your processors will handle a certain number of records. Suppose that each processor handles one million data records. Each processor executes a Map procedure that produces multiple records of key-value pairs <G, N> where G (key) is the geographical location a person (country) and N (value) is the number of contacts the person has.

Suppose each Map processor produces many pairs of the form <key, value>, such as the following:

        Processor Map#1: <France, 45>

        Processor Map#2: <Morocco, 23>

        Processor Map#3: <USA, 334>

        Processor Map#4: <Morocco, 443>

        Processor Map#5: <France, 8>

        Processor Map#6: <Morocco, 44>

In the Reduce phase, Hadoop assigns a task to a certain number of processors: Execute the Reduce procedure that aggregates the values of the same keys to produce a final result. For this example the Reduce implementation sums up the count of values for each key — geographical location. So, after the Map phase, the Reduce phase produces the following:

    <Morocco, 23+44+443=510> ------ <Morocco, 510>    
    <France, 8-45=53> ---- <France, 53>    

Clearly, Mr. X is a popular guy — but this was a very simple example of how MapReduce can be used. Imagine you’re dealing with a large dataset where you want to perform complex operations such as clustering billions of documents where the operation and the data is just too big for a single machine to handle. Hadoop is the tool to consider.