How to Summarize a Dataset in R - dummies

How to Summarize a Dataset in R

By Andrie de Vries, Joris Meys

If you need a quick overview of your dataset, you can, of course, always use the R command str() and look at the structure. But this tells you something only about the classes of your variables and the number of observations. Also, the function head() gives you, at best, an idea of the way the data is stored in the dataset.

How to get the output

To get a better idea of the distribution of your variables in the dataset, you can use the summary() function like this:

> summary(cars)
   mpg       cyl       am   gear
 Min.  :10.40  Min.  :4.000  auto :13  3:15
 1st Qu.:15.43  1st Qu.:4.000  manual:19  4:12
 Median :19.20  Median :6.000        5: 5
 Mean  :20.09  Mean  :6.188
 3rd Qu.:22.80  3rd Qu.:8.000
 Max.  :33.90  Max.  :8.000

The summary() function works best if you just use R interactively at the command line for scanning your dataset quickly. You shouldn’t try to use it within a custom function you wrote yourself.

The output of the summary() function shows you for every variable a set of descriptive statistics, depending on the type of the variable:

  • Numerical variables: summary() gives you the range, quartiles, median, and mean.

  • Factor variables: summary() gives you a table with frequencies.

  • Numerical and factor variables: summary() gives you the number of missing values, if there are any.

  • Character variables: summary() doesn’t give you any information at all apart from the length and the class (which is ‘character’).

How to fix a problem

Did you see the weird values for the variable cyl? A quick look at the summary can tell you there’s something fishy going on, as, for example, the minimum and the first quartile have exactly the same value. In fact, the variable cyl has only three values and would be better off as a factor. So, let’s put that variable out of its misery:

> cars$cyl <- as.factor(cars$cyl)