How to Summarize a Dataset in R
If you need a quick overview of your dataset, you can, of course, always use the R command str() and look at the structure. But this tells you something only about the classes of your variables and the number of observations. Also, the function head() gives you, at best, an idea of the way the data is stored in the dataset.
How to get the output
To get a better idea of the distribution of your variables in the dataset, you can use the summary() function like this:
> summary(cars) mpg cyl am gear Min. :10.40 Min. :4.000 auto :13 3:15 1st Qu.:15.43 1st Qu.:4.000 manual:19 4:12 Median :19.20 Median :6.000 5: 5 Mean :20.09 Mean :6.188 3rd Qu.:22.80 3rd Qu.:8.000 Max. :33.90 Max. :8.000
The summary() function works best if you just use R interactively at the command line for scanning your dataset quickly. You shouldn’t try to use it within a custom function you wrote yourself.
The output of the summary() function shows you for every variable a set of descriptive statistics, depending on the type of the variable:
Numerical variables: summary() gives you the range, quartiles, median, and mean.
Factor variables: summary() gives you a table with frequencies.
Numerical and factor variables: summary() gives you the number of missing values, if there are any.
Character variables: summary() doesn’t give you any information at all apart from the length and the class (which is ‘character’).
How to fix a problem
Did you see the weird values for the variable cyl? A quick look at the summary can tell you there’s something fishy going on, as, for example, the minimum and the first quartile have exactly the same value. In fact, the variable cyl has only three values and would be better off as a factor. So, let’s put that variable out of its misery:
> cars$cyl <- as.factor(cars$cyl)