Running Statistical Models in Hadoop’s MapReduce - dummies

Running Statistical Models in Hadoop’s MapReduce

By Dirk deRoos

Converting statistical models to run in parallel is a challenging task. In the traditional paradigm for parallel programming, memory access is regulated through the use of threads — sub-processes created by the operating system to distribute a single shared memory across multiple processors.

Factors such as race conditions between competing threads — when two or more threads try to change shared data at the same time — can influence the performance of your algorithm, as well as affect the precision of the statistical results your program outputs — particularly for long-running analyses of large sample sets.

A pragmatic approach to this problem is to assume that not many statisticians will know the ins and outs of MapReduce (and vice-versa), nor can you expect they’ll be aware of all the pitfalls that parallel programming entails. Contributors to the Hadoop project have (and continue to develop) statistical tools with these realities in mind.

The upshot: Hadoop offers many solutions for implementing the algorithms required to perform statistical modeling and analysis, without overburdening the statistician with nuanced parallel programming considerations.