How to Use Apply to Create Tabular Summaries in R - dummies

How to Use Apply to Create Tabular Summaries in R

By Andrie de Vries, Joris Meys

You use tapply() to create tabular summaries of data in R. With tapply(), you can easily create summaries of subgroups in data. This function takes three arguments:

  • X: A vector

  • INDEX: A factor or list of factors

  • FUN: A function

For example, calculate the mean sepal length in the dataset iris:

> tapply(iris$Sepal.Length, iris$Species, mean)
  setosa versicolor virginica
   5.006   5.936   6.588

With this short line of code, you do some powerful stuff. You tell R to take the Sepal.Length column, split it according to Species, and then calculate the mean for each group.

This is an important idiom for writing code in R, and it usually goes by the name Split, Apply, and Combine (SAC). In this case, you split a vector into groups, apply a function to each group, and then combine the result into a vector.

Of course, using the with() function, you can write your line of code in a slightly more readable way:

> with(iris, tapply(Sepal.Length, Species, mean))
  setosa versicolor virginica
   5.006   5.936   6.588

Using tapply(), you also can create more complex tables to summarize your data. You do this by using a list as your INDEX argument.

How to use tapply() to create higher-dimensional tables

For example, try to summarize the data frame mtcars, a built-in data frame with data about motor-car engines and performance. As with any object, you can use str() to inspect its structure:

> str(mtcars)

The variable am is a numeric vector that indicates whether the engine has an automatic (0) or manual (1) gearbox. Because this isn’t very descriptive, start by creating a new object, cars, that is a copy of mtcars, and change the column am to be a factor:

> cars <- within(mtcars,
+   am <- factor(am, levels=0:1, labels=c("Automatic", "Manual"))
+ )

Now use tapply() to find the mean miles per gallon (mpg) for each type of gearbox:

> with(cars, tapply(mpg, am, mean))
Automatic  Manual
 17.14737 24.39231

Yes, you’re correct. This is still only a one-dimensional table. Now, try to make a two-dimensional table with the type of gearbox (am) and number of gears (gear):

> with(cars, tapply(mpg, list(gear, am), mean))
 Automatic Manual
3 16.10667   NA
4 21.05000 26.275
5    NA 21.380

You use tapply() to create tabular summaries of data. This is a little bit similar to the table() function. However, table() can create only contingency tables (that is, tables of counts), whereas with tapply() you can specify any function as the aggregation function. In other words, with tapply(), you can calculate counts, means, or any other value.

If you want to summarize statistics on a single vector, tapply() is very useful and quick to use.

How to use aggregate()

Another R function that does something very similar is aggregate():

> with(cars, aggregate(mpg, list(gear=gear, am=am), mean))
 gear    am    x
1  3 Automatic 16.10667
2  4 Automatic 21.05000
3  4  Manual 26.27500
4  5  Manual 21.38000

Next, you take aggregate() to new heights using the formula interface.