How to Use Frequencies or Densities with Your Data in R
By breaking up your data in intervals in R, you still lose some information. Still, the most complete way of describing your data is by estimating the probability density function (PDF) or density of your variable.
If this concept is unfamiliar to you, don’t worry. Just remember that the density is proportional to the chance that any value in your data is approximately equal to that value. In fact, for a histogram, the density is calculated from the counts, so the only difference between a histogram with frequencies and one with densities, is the scale of the y-axis. For the rest, they look exactly the same.
How to create a density plot
You can estimate the density function of a variable using the density() function. The output of this function itself doesn’t tell you that much, but you can easily use it in a plot. For example, you can get the density of the mileage variable mpg like this:
> mpgdens <- density(cars$mpg)
The object you get this way is a list containing a lot of information you don’t really need to look at. But that list makes plotting the density as easy as saying “plot the density”:
The plot looks a bit rough on the edges, but the important thing is to see how your data comes out. The density object is plotted as a line, with the actual values of your data on the x-axis and the density on the y-axis.
The mpgdens list object contains — among other things — an element called x and one called y. These represent the x– and y-coordinates for plotting the density. When R calculates the density, the density() function splits up your data in a number of small intervals and calculates the density for the midpoint of each interval. Those midpoints are the values for x, and the calculated densities are the values for y.
How to plot densities in a histogram
Remember that the hist() function returns the counts for each interval. Now the chance that a value lies within a certain interval is directly proportional to the counts. The more values you have within a certain interval, the greater the chance that any value you picked is lying in that interval.
So, instead of plotting the counts in the histogram, you could just as well plot the densities. R does all the calculations for you — the only thing you need to do is set the freq argument of hist() to FALSE, like this:
> hist(cars$mpg, col='grey', freq=FALSE)
Now the plot will look exactly the same as before; only the values on the y-axis are different. The scale on the y-axis is set in such a way that you can add the density plot over the histogram. For that, you use the lines() function with the density object as the argument.
So, you can, for example, fancy up the previous histogram a bit further by adding the estimated density using the following code immediately after the previous command:
You see the result of these two commands on the right side. Remember that lines() uses the x and y elements from the density object mpgdens to plot the line.