Summarize Grouped Data with Bars, Boxes, and Whiskers - dummies

Summarize Grouped Data with Bars, Boxes, and Whiskers

By John Pezzullo

Sometimes you want to show how a variable varies from one group of subjects to another. For example, blood levels of some enzymes vary among the different races. Two types of graphs are commonly used for this purpose: bar charts and box-and-whiskers plots.

Bar charts

One simple way to display and compare the means of several groups of data is with a bar chart, like the one shown, where the bar height for each race equals the mean (or median, or geometric mean) value of the enzyme level for that race.

image0.jpg

And the bar chart becomes even more informative if you indicate the spread of values for each race by placing lines representing one standard deviation above and below the tops of the bars. These lines are always referred to as error bars (an unfortunate choice of words that can cause confusion when error bars are added to a bar chart).

But even with error bars, a bar chart still doesn’t give a very good picture of the distribution of enzyme levels within each group. Are the values skewed? Are there outliers? The mean and SD may not be very informative if the values are distributed log-normally or in another unusual way.

Ideally, you want to show a histogram for each group of subjects, but that may take up way too much space. What should you do? Keep reading to find out.

Box-and-whiskers charts

Fortunately, another kind of graph called a box-and-whiskers plot (or B&W, or just Box plot) shows — in very little space — a lot of information about the distribution of numbers in one or more groups of subjects. A simple B&W plot of the same enzyme data illustrated with a bar chart earlier is shown below, on the left.

image1.jpg

The B&W figure for each group usually has the following parts:

  • A box spanning the interquartile range (IQR), extending from the first quartile (25th centile) to the third quartile (75th centile) of the data, and therefore encompassing the middle 50 percent of the data

  • A thick horizontal line, drawn at the median (50th centile), which often puts it at or near the middle of the box

  • Dashed lines (whiskers) extending out to the farthest data point that’s not more than 1.5 times the IQR away from the box

  • Individual points lying outside the whiskers, considered outliers

B&W plots provide a useful summary of the distribution. A median that’s not located near the middle of the box indicates a skewed distribution.

Some software draws the different parts of a B&W plot according to different rules (the horizontal line may be at the mean instead of the median; the box may represent the mean ± 1 standard deviation; the whiskers may extend out to the farthest outliers; and so on). Always check the software’s documentation and provide the description of the parts whenever you present a B&W plot.

Some software provides various enhancements to the basic B&W plot. The figure to the right of the simple box plot illustrates two such embellishments you may consider using:

  • Variable width: The widths of the boxes can be scaled to indicate the relative size of each group. You can see that there are considerably fewer Asians and “others” than whites or blacks.

  • Notches: The box can have notches that indicate the uncertainty in the estimation of the median. If two groups have non-overlapping notches, they probably have significantly different medians. Whites and “others” have similar median enzyme levels, whereas Asians have significantly higher levels and blacks have significantly lower levels.