How to Cast Data to Wide Format in R - dummies

How to Cast Data to Wide Format in R

By Andrie de Vries, Joris Meys

If you have a molten dataset (a dataset in long format), you’re ready to reshape it with R. To illustrate that the process of reshaping keeps all your data intact, try to reconstruct the original:

> dcast(mgoals, Venue + Game ~ variable, sum)
 Game Venue Granny Geraldine Gertrude
1 1st Bruges   12     5    11
2 2nd Ghent   4     4    5
3 3rd Ghent   5     2    6
4 4th Bruges   6     4    7

Can you see how dcast() takes a formula as its second argument? More about that in a minute, but first inspect your results. It should match the original data frame.

Next, you may want to do something more interesting — for example, create a summary by venue and player.

You use the dcast() function to cast a molten data frame. To be clear, you use this to convert from a long format to a wide format, but you also can use this to aggregate into intermediate formats, similar to the way a pivot table works.

The dcast() function takes three arguments:

  • data: A molten data frame.

  • formula: A formula that specifies how you want to cast the data. This formula takes the form x_variable ~ y_variable. But it is simplified it to make a point. You can use multiple x-variables, multiple y-variables and even z-variables.

  • fun.aggregate: A function to use if the casting formula results in data aggregation (for example, length(), sum(), or mean()).

So, to get that summary of venue versus player, you need to use dcast() with a casting formula variable ~ Venue. Note that the casting formula refers to columns in your molten data frame:

> dcast(mgoals, variable ~ Venue , sum)
  variable Bruges Ghent
1  Granny   18   9
2 Geraldine   9   6
3 Gertrude   18  11

If you want to get a table with the venue running down the rows and the player across the columns, your casting formula should be Venue ~ variable:

> dcast(mgoals, Venue ~ variable , sum)
  Venue Granny Geraldine Gertrude
1 Bruges   18     9    18
2 Ghent   9     6    11

It’s actually possible to have more complicated casting formulae. According to the Help page for dcast(), the casting formula takes this format:

x_variable + x_2 ~ y_variable + y_2 ~ z_variable ~ ...

Notice that you can combine several variables in each dimension with the plus sign (+), and you separate each dimension with a tilde (~). Also, if you have two or more tildes in the formula (that is, you include a z-variable), your result will be a multidimensional array.

So, to get a summary of goals by Venue, player (variable), and Game, you do the following:

> dcast(mgoals, Venue + variable ~ Game , sum)
  Venue variable 1st 2nd 3rd 4th
1 Bruges  Granny 12  0  0  6
2 Bruges Geraldine  5  0  0  4
3 Bruges Gertrude 11  0  0  7
4 Ghent  Granny  0  4  5  0
5 Ghent Geraldine  0  4  2  0
6 Ghent Gertrude  0  5  6  0

One of the reasons you should understand data in long format is that both of the graphics packages lattice and ggplot2 make extensive use of long format data. The benefit is that you can easily create plots of your data that compares different subgroups.


> library(ggplot2)
> ggplot(mgoals, aes(x=variable, y=value, fill=Game)) + geom_bar()