How to Utilize Apache Mahout for Predictive Analytics

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

An open-source tool that is uniquely useful in predictive analytics is Apache Mahout. This machine-learning library includes large-scale versions of the clustering, classification, collaborative filtering, and other data-mining algorithms that can support a large-scale predictive analytics model.

A highly recommended way to process the data needed for such a model is to run Mahout in a system that’s already running Hadoop. Hadoop designates a master machine that orchestrates the other machines (such as Map machines and Reduce machines) employed in its distributed processing. Mahout should be installed on that master machine.

Imagine you have large amount of streamed data — Google news articles — and you would like to cluster by topic, using one of the clustering algorithms. After you install Hadoop and Mahout, you can execute one of the algorithms — such as K-means — on your data.

The implementation of K-means under Mahout uses a MapReduce approach, which makes it different from the normal implementation of K-means. Mahout subdivides the K-means algorithm into these sub-procedures:

  • KmeansMapper reads the input dataset and will assign each input point to its nearest initially selected means (cluster representatives).

  • KmeansCombiner procedure will take all the records — <key, value> pairs — produced by KmeansMapper and produces partial sums to ease the calculation of the subsequent cluster representatives.

  • KmeansReducer receives the values produced by all the subtasks (combiners) to calculate the actual centroids of the clusters which is the final output of K-means.

  • KmeansDriver handles the iterations of the process until all clusters have converged. The output of a given iteration, a partial clustering output, is used as the input for the next iteration. The process of mapping and reducing the dataset until the assignment of records and clusters show no further changes.

Apache Mahout is a recently developed project; its functionality still has a lot of space to accommodate extensions. In the meantime, Mahout already uses MapReduce to implement classification, clustering, and other machine-learning techniques — and can do so on a large scale.