Understanding Data in Long and Wide Formats in R
When talking about reshaping data in R, it’s important to recognize data in long and wide formats. These visual metaphors describe two ways of representing the same information. It’s helpful to know these formats when using R.
You can recognize data in wide format by the fact that columns generally represent groups. So, our example of basketball games is in wide format, because there is a column for the baskets made by each of the participants:
Game Venue Granny Geraldine Gertrude 1 1st Bruges 12 5 11 2 2nd Ghent 4 4 5 3 3rd Ghent 5 2 6 4 4th Bruges 6 4 7
In contrast, have a look at the long format of exactly the same data:
Game Venue variable value 1 1st Bruges Granny 12 2 2nd Ghent Granny 4 3 3rd Ghent Granny 5 4 4th Bruges Granny 6 5 1st Bruges Geraldine 5 6 2nd Ghent Geraldine 4 7 3rd Ghent Geraldine 2 8 4th Bruges Geraldine 4 9 1st Bruges Gertrude 11 10 2nd Ghent Gertrude 5 11 3rd Ghent Gertrude 6 12 4th Bruges Gertrude 7
Notice how, in the long format, the three columns for Granny, Geraldine, and Gertrude have disappeared. In their place, you now have a column called value that contains the actual score, and a column called variable that links the score to either of the three ladies.
When converting data between long and wide formats, it’s important to be able to distinguish identifier variables from measured variables:
Identifier variables: Identifier, or ID, variables identify the observations. Think of these as the key that identifies your observations. (In database design, these are called primary or secondary keys.)
Measured variables: This represents the measurements you observed.
In our example, the identifier variables are Game and Venue, while the measured variables are the goals (that is, the columns Granny, Geraldine, and Gertrude).