The Limitations of Sampling in Hadoop

By Dirk deRoos

Statistical analytics is far from being a new kid on the block, and it is certainly old news that it depends on processing large amounts of data to gain new insight. However, the amount of data that’s traditionally processed by these systems was in the range between 10 and 100 (or hundreds of) gigabytes — not the terabyte or petabyte ranges seen today, in other words.

And it often required an expensive symmetric multi-processing (SMP) machine with as much memory as possible to hold the data being analyzed. That’s because many of the algorithms used by the analytic approaches were quite “compute intensive” and were designed to run in memory — as they require multiple, and often frequent, passes through the data.

Faced with expensive hardware and a pretty high commitment in terms of time and RAM, folks tried to make the analytics workload a bit more reasonable by analyzing only a sampling of the data. The idea was to keep the mountains upon mountains of data safely stashed in data warehouses, only moving a statistically significant sampling of the data from their repositories to a statistical engine.

While sampling is a good idea in theory, in practice this is often an unreliable tactic. Finding a statistically significant sampling can be challenging for sparse and/or skewed data sets, which are quite common. This leads to poorly judged samplings, which can introduce outliers and anomalous data points, and can, in turn, bias the results of your analysis.