Hadoop Integration with R

By Dirk deRoos

In the beginning, big data and R were not natural friends. R programming requires that all objects be loaded into the main memory of a single machine. The limitations of this architecture are quickly realized when big data becomes a part of the equation.

In contrast, distributed file systems such as Hadoop are missing strong statistical techniques but are ideal for scaling complex operations and tasks. Vertical scaling solutions — requiring investment in costly supercomputing hardware — often cannot compete with the cost-value return offered by distributed, commodity hardware clusters.

To conform to the in-memory, single-machine limitations of the R language, data scientists often had to restrict analysis to only a subset of the available sample data. Prior to deeper integration with Hadoop, R language programmers offered a scale-out strategy for overcoming the in-memory challenges posed by large data sets on single machines.

This was achieved using message-passing systems and paging. This technique is able to facilitate work over data sets too large to store in main memory simultaneously; however, its low-level programming approach presents a steep learning curve for those unfamiliar with parallel programming paradigms.

Alternative approaches seek to integrate R’s statistical capabilities with Hadoop’s distributed clusters in two ways: interfacing with SQL query languages, and integration with Hadoop Streaming. With the former, the goal is to leverage existing SQL data warehousing platforms such as Hive and Pig. These schemas simplify Hadoop job programming using SQL-style statements in order to provide high-level programming for conducting statistical jobs over Hadoop data.

For programmers wishing to program MapReduce jobs in languages (including R) other than Java, a second option is to make use of Hadoop’s Streaming API. User-submitted MapReduce jobs undergo data transformations with the assistance of UNIX standard streams and serialization, guaranteeing Java-compliant input to Hadoop — regardless of the language originally inputted by the programmer.

Developers continue to explore various strategies to leverage the distributed computation capability of MapReduce and the almost limitless storage capacity of HDFS in ways that can be exploited by R.

Integration of Hadoop with R is ongoing, with offerings available from IBM (Big R as part of BigInsights) and Revolution Analytics (Revolution R Enterprise). Bridging solutions that integrate high-level programming and querying languages with Hadoop, such as RHive and RHadoop, are also available.

Fundamentally, each system aims to deliver the deep analytical capabilities of the R language to much larger sets of data.

RHive

The RHive framework serves as a bridge between the R language and Hive. RHive delivers the rich statistical libraries and algorithms of R to data stored in Hadoop by extending Hive’s SQL-like query language (HiveQL) with R-specific functions. Through the RHive functions, you can use HiveQL to apply R statistical models to data in your Hadoop cluster that you have cataloged using Hive.

RHadoop

Another open source framework available to R programmers is RHadoop, a collection of packages intended to help manage the distribution and analysis of data with Hadoop. Three packages of note — rmr2, rhdfs, and rhbase — provide most of RHadoop’s functionality:

  • rmr2: The rmr2 package supports translation of the R language into Hadoop-compliant MapReduce jobs (producing efficient, low-level MapReduce code from higher-level R code).

  • rhdfs: The rhdfs package provides an R language API for file management over HDFS stores. Using rhdfs, users can read from HDFS stores to an R data frame (matrix), and similarly write data from these R matrices back into HDFS storage.

  • rhbase: rhbase packages provide an R language API as well, but their goal in life is to deal with database management for HBase stores, rather than HDFS files.

Revolution R

Revolution R (by Revolution Analytics) is a commercial R offering with support for R integration on Hadoop distributed systems. Revolution R promises to deliver improved performance, functionality, and usability for R on Hadoop. To provide deep analytics akin to R, Revolution R makes use of the company’s ScaleR library — a collection of statistical analysis algorithms developed specifically for enterprise-scale big data collections.

ScaleR aims to deliver fast execution of R program code on Hadoop clusters, allowing the R developer to focus exclusively on their statistical algorithms and not on MapReduce. Furthermore, it handles numerous analytics tasks, such as data preparation, visualization, and statistical tests.

IBM BigInsights Big R

Big R offers end-to-end integration between R and IBM’s Hadoop offering, BigInsights, enabling R developers to analyze Hadoop data. The aim is to exploit R’s programming syntax and coding paradigms, while ensuring that the data operated upon stays in HDFS. R datatypes serve as proxies to these data stores, which means R developers don’t need to think about low-level MapReduce constructs or any Hadoop-specific scripting languages (like Pig).

BigInsights Big R technology supports multiple data sources — including flat files, HBase, and Hive storage formats — while providing parallel and partitioned execution of R code across the Hadoop cluster. It hides many of the complexities in the underlying HDFS and MapReduce frameworks, allowing Big R functions to perform comprehensive data analytics — on both structured and unstructured data.

Finally, the scalability of Big R’s statistical engine allows R developers to make use of both pre-defined statistical techniques, as well as author new algorithms themselves.