Understanding Data in Long and Wide Formats in R

Andrie de Vries

Joris Meys

Updated

2016-03-26 07:30:34

From the book

R For Dummies

Download E-Book

Statistical Analysis with R Essentials For Dummies

Explore Book

Download E-Book

Statistical Analysis with R Essentials For Dummies

Explore Book

When talking about reshaping data in R, it’s important to recognize data in long and wide formats. These visual metaphors describe two ways of representing the same information. It’s helpful to know these formats when using R.

You can recognize data in wide format by the fact that columns generally represent groups. So, our example of basketball games is in wide format, because there is a column for the baskets made by each of the participants:

  Game Venue Granny Geraldine Gertrude
1  1st Bruges   12     5    11
2  2nd Ghent   4     4    5
3  3rd Ghent   5     2    6
4  4th Bruges   6     4    7

In contrast, have a look at the long format of exactly the same data:

  Game Venue variable value
1  1st Bruges  Granny  12
2  2nd Ghent  Granny   4
3  3rd Ghent  Granny   5
4  4th Bruges  Granny   6
5  1st Bruges Geraldine   5
6  2nd Ghent Geraldine   4
7  3rd Ghent Geraldine   2
8  4th Bruges Geraldine   4
9  1st Bruges Gertrude  11
10  2nd Ghent Gertrude   5
11  3rd Ghent Gertrude   6
12  4th Bruges Gertrude   7

Notice how, in the long format, the three columns for Granny, Geraldine, and Gertrude have disappeared. In their place, you now have a column called value that contains the actual score, and a column called variable that links the score to either of the three ladies.

When converting data between long and wide formats, it’s important to be able to distinguish identifier variables from measured variables:

Identifier variables: Identifier, or ID, variables identify the observations. Think of these as the key that identifies your observations. (In database design, these are called primary or secondary keys.)
Measured variables: This represents the measurements you observed.

In our example, the identifier variables are Game and Venue, while the measured variables are the goals (that is, the columns Granny, Geraldine, and Gertrude).

About This Article

About the book author:

Andrie de Vries is a leading R expert and Business Services Director for Revolution Analytics. With over 20 years of experience, he provides consulting and training services in the use of R.

Joris Meys is a statistician, R programmer and R lecturer with the faculty of Bio-Engineering at the University of Ghent.

This article can be found in the category:

Hot off the press

Explore Related content

Statistical Analysis with R Essentials For Dummies

R All-in-One For Dummies

Statistical Analysis with R For Dummies

R Projects For Dummies

R For Dummies

Book & Article Categories

Book & Article Categories

Collections

Understanding Data in Long and Wide Formats in R

About This Article

About the book author:

This article can be found in the category:

Explore Related content

Book & Article Categories

Book & Article Categories

Collections

Understanding Data in Long and Wide Formats in R

About This Article

This article is from the book:

About the book author:

This article can be found in the category:

Explore Related content

R All-in-One For Dummies Cheat Sheet

Statistical Analysis with R For Dummies Cheat Sheet

R Project: Combining an Image with an Animated Image

11 Useful Resources for R Programmers

R Project: Delay and Weather

R Project for Neural Networks: Rattling Around

How K-Means Clustering Works for R Programming

Artificial Neural Networks and R Programming

R Project for RFM Analysis: Another Data Set

An R Project for SVMs: House Parties

Separability in R: It’s Usually Nonlinear

R Decision Trees in Rattle

Quick R Project: Understanding the Complexity Parameter

R Project: Identifying Mushrooms

R Project for ML Concepts: Titanic

Growing a Random Forest in R

Decision Trees in R

Using Rattle with iris for R Programming

Looking at the Rattle Log for R Programming

How ggplot2 Works in R