Understanding Data in Long and Wide Formats in R

By Andrie de Vries, Joris Meys

When talking about reshaping data in R, it’s important to recognize data in long and wide formats. These visual metaphors describe two ways of representing the same information. It’s helpful to know these formats when using R.

You can recognize data in wide format by the fact that columns generally represent groups. So, our example of basketball games is in wide format, because there is a column for the baskets made by each of the participants:

  Game Venue Granny Geraldine Gertrude
1  1st Bruges   12     5    11
2  2nd Ghent   4     4    5
3  3rd Ghent   5     2    6
4  4th Bruges   6     4    7

In contrast, have a look at the long format of exactly the same data:

  Game Venue variable value
1  1st Bruges  Granny  12
2  2nd Ghent  Granny   4
3  3rd Ghent  Granny   5
4  4th Bruges  Granny   6
5  1st Bruges Geraldine   5
6  2nd Ghent Geraldine   4
7  3rd Ghent Geraldine   2
8  4th Bruges Geraldine   4
9  1st Bruges Gertrude  11
10  2nd Ghent Gertrude   5
11  3rd Ghent Gertrude   6
12  4th Bruges Gertrude   7

Notice how, in the long format, the three columns for Granny, Geraldine, and Gertrude have disappeared. In their place, you now have a column called value that contains the actual score, and a column called variable that links the score to either of the three ladies.

When converting data between long and wide formats, it’s important to be able to distinguish identifier variables from measured variables:

  • Identifier variables: Identifier, or ID, variables identify the observations. Think of these as the key that identifies your observations. (In database design, these are called primary or secondary keys.)

  • Measured variables: This represents the measurements you observed.

In our example, the identifier variables are Game and Venue, while the measured variables are the goals (that is, the columns Granny, Geraldine, and Gertrude).