# Step by Step: The Empirical Cumulative Distribution Function in R

The *empirical cumulative distribution function* (ecdf) is closely related to cumulative frequency. Rather than show the frequency in an interval, however, the ecdf shows the proportion of scores that are less than or equal to each score.

In base R, it’s easy to plot the ecdf:

`plot(ecdf(Cars93$Price), xlab = "Price", ylab = "Fn(Price)")`

This produces the following figure.

The uppercase *F* on the y-axis is a notational convention for a cumulative distribution. The `Fn`

means, in effect, “cumulative function” as opposed to `f`

or `fn`

, which just means “function.” (The y-axis label could also be `Percentile(Price)`

.)

Look closely at the plot. When consecutive points are far apart (like the two on the top right), you can see a horizontal line extending rightward out of a point. (A line extends out of every point, but the lines aren’t visible when the points are bunched up.) Think of this line as a “step” and then the next dot is a step higher than the previous one. How much higher? That would be 1/*N*, where *N* is the number of scores in the sample. For `Cars93`

, that would be 1/93, which rounds off to .011.

Why is this called an “empirical” cumulative distribution function? Something that’s *empirical* is based on observations, like sample data. Is it possible to have a non-empirical cumulative distribution function (cdf)? Yes — and that’s the cdf of the population that the sample comes from. One important use of the ecdf is as a tool for estimating the population cdf.

So the plotted ecdf is an estimate of the cdf for the population, and the estimate is based on the sample data. To create an estimate, you assign a probability to each point and then add up the probabilities, point by point, from the minimum value to the maximum value. This produces the cumulative probability for each point. The probability assigned to a sample value is the estimate of the proportion of times that value occurs in the population. What is the estimate? That’s the aforementioned 1*/N* for each point — .011, for this sample. For any given value, that might not be the exact proportion in the population. It’s just the best estimate from the sample.

You might prefer to use `ggplot() `

to visualize the ecdf. Because you base the plot on a vector `(Cars93$Price)`

, the data source is `NULL`

:

`ggplot(NULL, aes(x=Cars93$Price))`

In keeping with the step-by-step nature of this function, the plot consists of steps, and the `geom `

function is `geom_step`

. The statistic that locates each step on the plot is the ecdf, so that’s

`geom_step(stat="ecdf")`

and label the axes:

`labs(x= "Price X $1,000",y = "Fn(Price)")`

Putting those three lines of code together

`ggplot(NULL, aes(x=Cars93$Price)) +`

`geom_step(stat="ecdf") +`

`labs(x= "Price X $1,000",y = "Fn(Price)")`

gives you this figure:

To put a little pizzazz in the graph, add a dashed vertical line at each quartile. Before adding the `geom `

function for a vertical line, put the quartile information in a vector:

`price.q <-quantile(Cars93$Price)`

And now

`geom_vline(aes(xintercept=price.q),linetype = "dashed")`

adds the vertical lines. The aesthetic mapping sets the x-intercept of each line at a quartile value.

So these lines of code

`ggplot(NULL, aes(x=Cars93$Price)) +`

`geom_step(stat="ecdf") +`

`labs(x= "Price X $1,000",y = "Fn(Price)") +`

`geom_vline(aes(xintercept=price.q),linetype = "dashed")`

result in the following figure.

A nice finishing touch is to put the quartile-values on the x-axis. The function `scale_x_continuous()`

gets that done. It uses one argument called `breaks `

(which sets the location of values to put on the axis) and another called labels (which puts the values on those locations). Here’s where that `price.q `

vector comes in handy:

`scale_x_continuous(breaks = price.q,labels = price.q)`

And here’s the R code that creates the following figure:

`ggplot(NULL, aes(x=Cars93$Price)) +`

`geom_step(stat="ecdf") +`

`labs(x= "Price X $1,000",y = "Fn(Price)") +`

`geom_vline(aes(xintercept=price.q),linetype = "dashed")+`

`scale_x_continuous(breaks = price.q,labels = price.q)`