How to Describe the Variation of Data in R - dummies

How to Describe the Variation of Data in R

By Andrie de Vries, Joris Meys

A single number doesn’t tell you much about your data. Often it’s as important to know the spread of your data. You can use R to look at this spread using a number of different approaches.

First, you can calculate either the variance or the standard deviation to summarize the spread in a single number. For that, you have the convenient functions var() for the variance and sd() for the standard deviation. For example, you calculate the standard deviation of the variable mpg in the data frame cars like this:

> sd(cars$mpg)
[1] 6.026948

Next to the mean and variation, you also can take a look at the quantiles. A quantile, or percentile, tells you how much of your data lies below a certain value. The 50 percent quantile, for example, is nothing but the median. Again, R has some convenient functions to help you with looking at the quantiles.

How to calculate data range in R

The most-used quantiles are actually the 0 percent and 100 percent quantiles. You could just as easily call them the minimum and maximum, because that’s what they are. You can get both min() and max() functions together using the range() function. This function conveniently gives you the range of the data. So, to know between which two values all the mileages are situated, you simply do the following:

> range(cars$mpg)
[1] 10.4 33.9

How to calculate data quartiles in R

The range still gives you only limited information. Often statisticians report the first and the third quartile next to the range and the median. These quartiles are, respectively, the 25 percent and 75 percent quantiles, which are the numbers for which one-fourth and three-fourths of the data is smaller. You get these numbers using the quantile() function, like this:

> quantile(cars$mpg)
  0%  25%  50%  75%  100%
10.400 15.425 19.200 22.800 33.900

The quartiles are not the same as the lower and upper hinge calculated in the five-number summary. The latter two are, respectively, the median of the lower and upper half of your data, and they differ slightly from the first and third quartiles. To get the five number statistics, you use the fivenum() function.

How to get on speed with the quantile function in R

The quantile() function can give you any quantile you want. For that, you use the probs argument. You give the probs (or probabilities) as a fractional number. For the 20 percent quantile, for example, you use 0.20 as an argument for the value. This argument also takes a vector as a value, so you can, for example, get the 5 percent and 95 percent quantiles like this:

> quantile(cars$mpg, probs=c(0.05, 0.95))
  5%  95%
11.995 31.300

The default value for the probs argument is a vector representing the minimum (0), the first quartile (0.25), the median (0.5), the third quartile (0.75), and the maximum (1).

All of these functions have an argument na.rm that allows you to remove all NA values before calculating the respective statistic. If you don’t do this, any vector containing NA will have NA as a result. This works identically to the na.rm argument of the sum() function.