Machine Learning with Mahout in Hadoop
Machine learning refers to a branch of artificial intelligence techniques that provides tools enabling computers to improve their analysis based on previous events. These computer systems leverage historical data from previous attempts at solving a task in order to improve the performance of future attempts at similar tasks.
In terms of expected outcomes, machine learning may sound a lot like that other buzzword “data mining”; however, the former focuses on prediction through analysis of prepared training data, the latter is concerned with knowledge discovery from unprocessed raw data. For this reason, machine learning depends heavily upon statistical modelling techniques and draws from areas of probability theory and pattern recognition.
Mahout is an open source project from Apache, offering Java libraries for distributed or otherwise scalable machine-learning algorithms.
These algorithms cover classic machine learning tasks such as classification, clustering, association rule analysis, and recommendations. Although Mahout libraries are designed to work within an Apache Hadoop context, they are also compatible with any system supporting the MapReduce framework. For example, Mahout provides Java libraries for Java collections and common math operations (linear algebra and statistics) that can be used without Hadoop.
As you can see, the Mahout libraries are implemented in Java MapReduce and run on your cluster as collections of MapReduce jobs on either YARN (with MapReduce v2), or MapReduce v1.
Mahout is an evolving project with multiple contributors. By the time of this writing, the collection of algorithms available in the Mahout libraries is by no means complete; however, the collection of algorithms implemented for use continues to expand with time.
There are three main categories of Mahout algorithms for supporting statistical analysis: collaborative filtering, clustering, and classification.
Mahout was specifically designed for serving as a recommendation engine, employing what is known as a collaborative filtering algorithm. Mahout combines the wealth of clustering and classification algorithms at its disposal to produce more precise recommendations based on input data.
These recommendations are often applied against user preferences, taking into consideration the behavior of the user. By comparing a user’s previous selections, it is possible to identify the nearest neighbors (persons with a similar decision history) to that user and predict future selections based on the behavior of the neighbors.
Consider a “taste profile” engine such as Netflix — an engine which recommends ratings based on that user’s previous scoring and viewing habits. In this example, the behavioral patterns for a user are compared against the user’s history — and the trends of users with similar tastes belonging to the same Netflix community — to generate a recommendation for content not yet viewed by the user in question.
Unlike the supervised learning method for Mahout’s recommendation engine feature, clustering is a form of unsupervised learning — where the labels for data points are unknown ahead of time and must be inferred from the data without human input (the supervised part).
Generally, objects within a cluster should be similar; objects from different clusters should be dissimilar. Decisions made ahead of time about the number of clusters to generate, the criteria for measuring “similarity,” and the representation of objects will impact the labelling produced by clustering algorithms.
For example, a clustering engine that is provided a list of news articles should be able to define clusters of articles within that collection which discuss similar topics.
Suppose a set of articles about Canada, France, China, forestry, oil, and wine were to be clustered. If the maximum number of clusters were set to 2, your algorithm might produce categories such as “regions” and “industries.” Adjustments to the number of clusters will produce different categorizations; for example, selecting for 3 clusters may result in pairwise groupings of nation-industry categories.
Classification algorithms make use of human-labelled training data sets, where the categorization and classification of all future input is governed by these known labels. These classifiers implement what is known as supervised learning in the machine learning world.
Classification rules — set by the training data, which has been labelled ahead of time by domain experts — are then applied against raw, unprocessed data to best determine their appropriate labelling.
These techniques are often used by e-mail services which attempt to classify spam e-mail before they ever cross your inbox. Specifically, given an e-mail containing a set of phrases known to commonly occur together in a certain class of spam mail — delivered from an address belonging to a known botnet — your classification algorithm is able to reliably identify the e-mail as malicious.
In addition to the wealth of statistical algorithms that Mahout provides natively, a supporting User Defined Algorithms (UDA) module is also available. Users can override existing algorithms or implement their own through the UDA module. This robust customization allows for performance tuning of native Mahout algorithms and flexibility in tackling unique statistical analysis challenges.
If Mahout can be viewed as a statistical analytics extension to Hadoop, UDA should be seen as an extension to Mahout’s statistical capabilities.
Traditional statistical analysis applications (such as SAS, SPSS, and R) come with powerful tools for generating workflows. These applications utilize intuitive graphical user interfaces that allow for better data visualization. Mahout scripts follow a similar pattern as these other tools for generating statistical analysis workflows.
During the final data exploration and visualization step, users can export to human-readable formats (JSON, CSV) or take advantage of visualization tools such as Tableau Desktop.
Mahout’s architecture sits atop the Hadoop platform. Hadoop unburdens the programmer by separating the task of programming MapReduce jobs from the complex bookkeeping needed to manage parallelism across distributed file systems. In the same spirit, Mahout provides programmer-friendly abstractions of complex statistical algorithms, ready for implementation with the Hadoop framework.