Factors That Increase the Scale of Statistical Analysis in Hadoop

By Dirk deRoos

The reason people sample their data before running statistical analysis in Hadoop is that this kind of analysis often requires significant computing resources. This isn’t just about data volumes: there are five main factors that influence the scale of statistical analysis:

  • This one’s easy, but we have to mention it: the volume of data on which you’ll perform the analysis definitely determines the scale of the analysis.

  • The number of transformations needed on the data set before applying statistical models is definitely a factor.

  • The number of pairwise correlations you’ll need to calculate plays a role.

  • The degree of complexity of the statistical computations to be applied is a factor.

  • The number of statistical models to be applied to your data set plays a significant role.

Hadoop offers a way out of this dilemma by providing a platform to perform massively parallel processing computations on data in Hadoop.

In doing so, it’s able to flip the analytic data flow; rather than move the data from its repository to the analytics server, Hadoop delivers analytics directly to the data. More specifically, HDFS allows you to store your mountains of data and then bring the computation (in the form of MapReduce tasks) to the slave nodes.

The common challenge posed by moving from traditional symmetric multi-processing statistical systems (SMP) to Hadoop architecture is the locality of the data. On traditional SMP platforms, multiple processors share access to a single main memory resource.

In Hadoop, HDFS replicates partitions of data across multiple nodes and machines. Also, statistical algorithms that were designed for processing data in-memory must now adapt to datasets that span multiple nodes/racks and could not hope to fit in a single block of memory.