R on Hadoop and the R Language

By Dirk deRoos

The machine learning discipline has a rich and extensive catalogue of techniques. Mahout brings a range of statistical tools and algorithms to the table, but it only captures a fraction of those techniques and algorithms, as the task of converting these models to a MapReduce framework is a challenging one.

Over time, Mahout is sure to continue expanding its statistical toolbox, but until then all data scientists and statisticians out there need to be aware of alternative statistical modelling software — which is where R comes in.

The R language is a powerful and popular open-source statistical language and development environment. It offers a rich analytics ecosystem that can assist data scientists with data exploration, visualization, statistical analysis and computing, modelling, machine learning, and simulation. The R language is commonly used by statisticians, data miners, data analysts, and (nowadays) data scientists.

R language programmers have access to the Comprehensive R Archive Network (CRAN) libraries which, as of the time of this writing, contains over 3000 statistical analysis packages. These add-ons can be pulled into any R project, providing rich analytical tools for running classification, regression, clustering, linear modelling, and more specialized machine learning algorithms.

The language is accessible to those familiar with simple data structure types — vectors, scalars, data frames (matrices), and the like — commonly used by statisticians as well as programmers.

Out of the box, one of the major pitfalls with using the R language is the lack of support it offers for running concurrent tasks. Statistical language tools like R excel at rigorous analysis, but lack scalability and native support for parallel computations.

These systems are non-distributable and were not developed to be scalable for the modern petabyte-world of big data. Proposals for overcoming these limitations need to extend R’s scope beyond in-memory loading and single computer execution environments, while maintaining R’s flair for easily-deployable statistical algorithms.