 How to Spot Statistical Variability in a Histogram - dummies

# How to Spot Statistical Variability in a Histogram

You can get a sense of variability in a statistical data set by looking at its histogram. For example, if the data are all the same, they are all placed into a single bar, and there is no variability. If an equal amount of data is in each of several groups, the histogram looks flat with the bars close to the same height; this signals a fair amount of variability.

The idea of a flat histogram indicating some variability may go against your intuition, and if it does you’re not alone. If you’re thinking a flat histogram means no variability, you’re probably thinking about a time chart, where single numbers are plotted over time. Remember, though, that a histogram doesn’t show data over time — it shows all the data at one point in time. Since the histogram is flat, that means that the data are spread out across the spectrum, hence a high variability.

Equally interesting is the idea that a histogram with a big lump in the middle and tails sloping sharply down on each side actually has less variability than a histogram that’s straight across. The curves looking like hills in a histogram represent clumps of data that are close together, hence a low variability.

Variability in a histogram is higher when the taller bars are more spread out away from the mean and lower when the taller bars are close to the mean. For the Best Actress Academy Award winners’ ages shown in the above figure, you see many actresses are in the age range from 30–35, and most of the actresses are between 20–50 years in age, which is quite diverse; then you have those outliers, those few older actresses (7 of them) that spread the data out farther, increasing the data’s overall variability.

The most common statistic used to measure variability in a data set is the standard deviation, which in a rough sense measures the “average” or “typical” distance that the data lie from the mean. The standard deviation for the Best Actress age data is 11.35 years. A standard deviation of 11.35 years is fairly large in the context of this problem, but the standard deviation is based on average distance from the mean, and the mean is influenced by outliers, so the standard deviation will be influenced as well.