Predict Customer Values with the Regression Line
While a correlation speaks to the strength of a relationship between two variables, and the r2 helps explain that strength of the relationship, what you need to do to predict one variable from another is to use an extension of correlation called regression analysis. Regression analysis is known as a “workhorse” in predictive analytics. The math isn’t too complicated, and most software packages support regression analysis.
Regression analysis extends the idea of the scatterplot used in correlation and adds a line that best “fits” the data.
One of the requirements of using correlations and regression analysis is that the data is linear. Linear means a line can reasonably describe the relationship between variables and then be used to predict values that don’t appear in your data (future customer data points). If the scatterplot of your data forms a curve, or any shape that a line doesn’t fit well, you may get misleading results.
While there are many ways to draw lines through the data, the least squares analysis is a mathematical way that reduces the distance between the line and each dot in the scatterplot. This analysis can be done by hand or by using software such as Minitab, SPSS, SAS, R, or Excel.
The figure shows a least squares regression line.
The software gives you the equation to the regression line above the graph:
Time = 86.57 + 4.486 Taps
The regression equation takes the general form of
Here’s an explanation of each part of the equation:
(pronounced y-hat): This is the predicted value of the dependent variable: predicted time.
b0: Called the y-intercept, this is where the line would cross (or intercept) with the y-axis.
b1: This is the slope of the predicted line (how steep it is).
X: This represents a particular value of the independent variable: taps.
e: represents the inevitable error the prediction will contain.
So in this example, the regression equation indicates that the predicted amount of time it takes a customer to make a purchase is equal to 86.57 (the y-intercept) plus 4.486 (the slope) multiplied by the number of taps (X).
It’s the regression formula that allows you to predict customer values that don’t exist in your data. It allows you to perform “what-if” analyses on future customer values. This is the “predictive” part of predictive customer analytics.
For example, using the regression equation from the preceding example, you can predict how long a customer takes to make a purchase with 38 taps. You just fill 38 in the regression equation.
Time = 86.57 + 4.486(38)
Time = 86.57 + 170.47 = 257.04
A customer needs 257 seconds, or a bit longer than four minutes, to make a purchase that requires 38 taps.
The dependent variable is denoted “Y” and is displayed on the y (vertical) axis. The independent variable is called X and is displayed on the horizontal (x) axis.
Instead of predicting a customer’s task time from taps, this same approach can be used to predict other customer analytics, including:
Customer revenue from advertising revenue
Likelihood to recommend from usability data
Number of conversions from website page views