How to Plot Quantiles for Subgroups in R - dummies

How to Plot Quantiles for Subgroups in R

By Andrie de Vries, Joris Meys

Often you want to split up data analysis for different subgroups in R in order to compare them. You need to do this if you want to know how the average lip size compares between male and female kissing gouramis (great fish by the way!) or, in the case of our example, you want to know whether the number of cylinders in a car influences the mileage.

Of course you can use tapply() to calculate any of the descriptives for subgroups defined by a factor variable. But in R you find some more tools for summarizing descriptives for different subgroups.

One way to quickly compare groups is to construct a box-and-whisker plot from the data. You could construct this plot by calculating the range, the quartiles, and the median for each group, but luckily you can just tell R to do all that for you. For example, if you want to know how the mileage compares between cars with a different number of cylinders, you simply use the boxplot() function:


> boxplot(mpg ~ cyl, data=cars)

You supply a simple formula as the first argument to boxplot(). This formula reads as “plot boxes for the variable mpg for the groups defined by the variable cyl.”

This plot uses quantiles to give you an idea of how the data is spread within each subgroup. The line in the middle of each box represents the median, and the edges of the box represent the first and the third quartiles. The whiskers extend to either the minimum and the maximum of the data or 1.5 times the distance between the first and the third quartiles, whichever is smaller.

To be completely correct, the edges of the box represent the lower and upper hinges from the five-number summary, calculated using the fivenum() function. They’re equal to the quartiles only if you have an odd number of observations in your data. Otherwise, the results of fivenum() and quantile() may differ a bit due to differences in the details of the calculation.

You can let the whiskers always extend to the minimum and the maximum by setting the range argument of the boxplot() function to 0.