How to Quantify the Strength of a Relationship with Analytics

By Jeff Sauro

You can numerically quantify the strength of an association by using the Pearson Product Moment Correlation. It’s often just called the correlation coefficient and is represented by the symbol r. The correlation is used to quantify the association between two continuous variables, (such as revenue, time, or rating scales).

The correlation coefficient varies from an r of –1, which indicates a perfect negative correlation to 1, which means a perfect positive correlation. The figure shows three examples of scatterplots that show a perfect negative correlation (r = -1), no relationships (r = 0), and a perfect positive relationship (r = 1).

image0.jpg

Using two perfectly correlated variables isn’t helpful. They’re redundant; if you have the value for one variable, you can perfectly predict the other.

In practice, correlations are weak to strong. Some examples of correlations of different strengths include:

  • Height and Weight: r = .8

  • Scholastic Aptitude Test (SAT) and First-Year College Grades: r = .5

  • Usability and Customer Loyalty: r = .7

The correlation between variables means that one variable can predict the value of the other variable:

  • If you know a customer’s height, you can estimate his weight.

  • If you know a customer’s weight, you can estimate his height.

But because these aren’t perfect correlations, the further a correlation is from 1 or –1, the more error you have in predicting one variable based on the other.

Computing a correlation

You can compute the correlation coefficient by hand, or use software like Excel to compute it for you.

To compute a correlation on a set of data using the Pearson Correlation formula, follow these steps. (The following figure shows the data being used in this example.)

image1.jpg

  1. Set up the data in rows and columns in Excel.

    Have one column for each variable and the customers’ IDs. Each row should represent the same customer’s data on two variables. The next figure shows 17 customers’ time to make the purchase and the number of taps needed for the purchase.

    image2.jpg

  2. In any cell, type

    =PEARSON(
  3. Select all the values for the first variable.

    The data for time appears in column B and the data goes from cell B2 to cell B182.

  4. Type a comma (,) and select all the values for the second variable.

    This data appears in column C and the data goes from cell C2 to cell C182.

    Be sure to select the same number of values for both variables.

  5. Close the parenthesis and then press Enter to get the correlation.

    =PEARSON(B2:B182,C2:C182)

    The correlation for this data, between taps and time, is .560666. There’s a positive correlation between time and taps.

Interpreting the strength of a correlation

Once you compute a correlation, you need to interpret the strength of the relationship. The correlation between taps and time is r = .56. Is that a strong correlation? It depends.

The strength of a correlation is context dependent. A “strong” correlation in one context may be a weak correlation in another. It depends on how much error you can tolerate and the consequences for being wrong in your predictions.

Predicting time from taps probably won’t involve a loss of life or money, so it’s strong enough to be useful. In fact, it’s about the same strength of an association as between the SAT and first-year college grades — where there’s a lot at stake!

While correlations are context dependent, it can help to have some guidance on what you’ll likely see with customer analytics data. A famous researcher by the name of Jacob Cohen examined correlations in the behavioral sciences, something similar to measuring customer behavior, and provided the following rules based on how common the correlations were reported in the peer-review literature:

  • Small r = .10

  • Medium: r = .30

  • Large r = .50

Therefore, one simple interpretation of correlation of r = .56 between taps and time is that it’s large. But there is another way of interpreting the correlation coefficient.

Coefficient of determination r2

Multiplying the correlation coefficient by itself (squaring it) produces a metric known as the coefficient of determination. It’s represented as r2 (pronounced r-squared) and provides a better way of interpreting the strength of a relationship.

For example, a correlation of r = .5 squared becomes .25. Note that r2 is often expressed as a percentage, 25%. For the correlation between taps and time, the r2 is 31%. That means taps can explain 31% of the variation in time. And conversely, time explains 31% of the variation in taps. As you can see, even a strong correlation of above r = .5 still explains a minority of the differences between variables.

Height, for example, explains around 64% of the variation in weight. That means that knowing people’s heights will explain most — but not all — of why they are a certain weight. Other factors explain 36% of the variation. That would include things like exercise, eating habits, or genetic factors that make some people weigh more at a certain height than others of the same height.

Use this same approach when correlating customer analytics. Find the correlation, square it, and then interpret the r-squared value. When stakes are high, you want to have high correlations and explain most of the variation between variables. With customer analytics, there are usually multiple variables that predict another variable.

Correlation is not causation

One of the most important concepts about correlation that you will hear repeated, because it’s worth repeating, is that correlation is not causation. That means just because one variable is correlated with another, doesn’t mean one variable is caused by another variable. Time doesn’t cause taps. SAT scores don’t cause higher grades. Net Promoter Scores don’t cause higher revenue.

You can say there is an association, but that association doesn’t imply causation.

It could be that a new design causes higher website conversion rates or it could be that a coupon increases same-store sales. However, there could be other variables that are actually affecting the outcome variable.

For example, it could be that same-store sales were already increasing because of an increase in customers. Or it could be that more customers are converting on a website (making a purchase) because the competitor website sold out of the same product — not because of your website design change. Always consider what other variables might be affecting the relationship when making statements about causation.