How to Track Data Correlations in R - dummies

How to Track Data Correlations in R

By Andrie de Vries, Joris Meys

Statisticians love it when they can link one data variable to another. R can help to find this relationship. Sunlight, for example, is detrimental to skirts: The longer the sun shines, the shorter skirts become. Thus, the number of hours of sunshine correlates with skirt length.

Obviously, there isn’t really a direct causal relationship here — you won’t find short skirts during the summer in polar regions. But, in many cases, the search for causal relationships starts with looking at correlations.

To illustrate this, take a look at the famous iris dataset in R. One of the greatest statisticians of all time, Sir Ronald Fisher, used this dataset to illustrate how multiple measurements can be used to discriminate between different species. This dataset contains five variables, as you can see by using the names() function:

> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length"
[4] "Petal.Width" "Species"

It contains measurements of flower characteristics for three species of iris and from 50 flowers for each species. Two variables describe the sepals (Sepal.Length and Sepal.Width), two other variables describe the petals (Petal.Length and Petal.Width), and the last variable (Species) is a factor indicating from which species the flower comes.

Although looks can be deceiving, you want to eyeball your data before digging deeper into it. To plot a grid of scatterplots for all combinations of two variables in your dataset, you can simply use the plot() function on your data frame, like this:

> plot(iris[-5])

Because scatterplots are useful only for continuous variables, you can drop all variables that are not continuous. Too many variables in the plot matrix makes the plots difficult to see. In the previous code, you drop the variable Species, because that’s a factor.

You can see the result of this simple line of code. The variable names appear in the squares on the diagonal, indicating which variables are plotted along the x-axis and the y-axis. For example, the second plot on the third row has Sepal.Width on the x-axis and Petal.Length on the y-axis.