3 Hadoop Cluster Configurations
The Architecture of Apache Hive
Pivotal HAWQ and Hadoop

Running Statistical Models in Hadoop’s MapReduce

Converting statistical models to run in parallel is a challenging task. In the traditional paradigm for parallel programming, memory access is regulated through the use of threads — sub-processes created by the operating system to distribute a single shared memory across multiple processors.

Factors such as race conditions between competing threads — when two or more threads try to change shared data at the same time — can influence the performance of your algorithm, as well as affect the precision of the statistical results your program outputs — particularly for long-running analyses of large sample sets.

A pragmatic approach to this problem is to assume that not many statisticians will know the ins and outs of MapReduce (and vice-versa), nor can you expect they’ll be aware of all the pitfalls that parallel programming entails. Contributors to the Hadoop project have (and continue to develop) statistical tools with these realities in mind.

The upshot: Hadoop offers many solutions for implementing the algorithms required to perform statistical modeling and analysis, without overburdening the statistician with nuanced parallel programming considerations.

  • Add a Comment
  • Print
  • Share
blog comments powered by Disqus
Hadoop Sqoop for Big Data
Developing Oozie Workflows in Hadoop
Machine Learning with Mahout in Hadoop
Checkpointing Updates in Hadoop Distributed File System
The Reduce Phase of Hadoop’s MapReduce Application Flow

Inside Dummies.com