Robust Statistics and Big Data

Statistics for Big Data For Dummies

A statistic is said to be robust if it isn’t strongly influenced by the presence of outliers. For example, the mean is not robust because it can be strongly affected by the presence of outliers. On the other hand, the median is robust — it isn’t affected by outliers.

For example, suppose the following data represents a sample of household incomes in a small town (measured in thousands of dollars per year):

32, 47, 20, 25, 56

You compute the sample mean as the sum of the five observations divided by five:

The sample mean is $36,000 per year. Most of the households in the sample are very close to this value.

Suppose instead that the sample consists of the following values:

32, 47, 20, 25, 376

Because the household income of $376,000 is substantially greater than the next closest household income of $32,000, the household income of $376,000 can be considered to be an outlier.

With the outlier, the sample mean is now as follows:

This measure isn’t representative of most of the households in the town. Thus, the usefulness of the mean is compromised in the presence of outliers.

You compute the median of the sample by sorting the data from lowest to highest and then finding the value which divides the sample in half. In other words, half of the observations are below the median, and half are above.

The first sample:

32, 47, 20, 25, 56

The sorted sample:

20, 25, 32, 47, 56

In this case, the median is 32 because half of the remaining observations are below 32 and half are above it.

The second sample:

32, 47, 20, 25, 376

The sorted sample:

20, 25, 32, 47, 376

Despite the presence of the outlier of 376, the median is still 32. It hasn’t been affected by the outlier. This shows that unlike the mean, the median is robust with respect to outliers.

Other examples of robust statistics include the median, absolute deviation, and the interquartile range.

About This Article

About the book author:

Alan Anderson, PhD is a teacher of finance, economics, statistics, and math at Fordham and Fairfield universities as well as at Manhattanville and Purchase colleges. Outside of the academic environment he has many years of experience working as an economist, risk manager, and fixed income analyst. Alan received his PhD in economics from Fordham University, and an M.S. in financial engineering from Polytechnic University.

David Semmelroth has two decades of experience translating customer data into actionable insights across the financial services, travel, and entertainment industries. David has consulted for Cedar Fair, Wachovia, National City, and TD Bank.