Customer Analytics and Training and Validation Periods

By Jeff Sauro

A sophisticated and often essential approach to time series analysis involves partitioning your customer data into training and validation periods. In the training period, you build a regression equation on the earliest section of data (approximately two-thirds to three-fourths of your data).

You then apply the regression equation to the later part of your data in the validation period to see how well the earlier data actually predicts the later data.

With the subscriber data, you could use the first 20 months (January 2012 through August 2013) as the training period and September 2013 to February 2014 as the validation period. This approach is testing the equation using data you already have, which is as close as you can get to testing how well a prediction might perform when new data comes in.

The regression equation for the first 20 months is:

Subscribers = 2033.9e0.0269x

The r² = 0.9979, which shows a good fit for the exponential line. You can then use this regression equation to see how well it predicts the final six months of the dataset. The last six months are 21 through 26. The figure shows the predicted and actual values for August 2013 through February 2014, labeled the Validation (in the Period column).

image0.jpg

To assess how well this prediction actually is, two additional columns were created. The first is the raw error from the actual number to the prediction. For example, in September 2013, the prediction was short by 5 subscribers. In February 2014, it was short by 28. This sort of raw error can itself be understandable, if you’re familiar with the customer data you’re working with.

When communicating how much error your predicted values have, it’s often easier to speak in terms of percentage error.

The Mean Absolute Percentage Error (MAPE) can be a bit more understandable to stakeholders. It’s computed by finding the absolute value of the difference between the actual and predicted values, then dividing that difference by the actual value to compute the absolute percentage error. This is then averaged for each value.

image1.jpg

The APE column shows the absolute percentage error. For example, for January 2013, the regression equation predicted 2,885 subscribers; the actual number of subscribers was 2,844, meaning the equation overpredicted by 41 subscribers.

Applying the Excel formula for the absolute percentage error (APE) generates an error of 1.4%:

=ABS(2885-2844)/2885 = .014 or 1.4%

The MAPE for the training period is .589%. The MAPE for the validation period is .870%, which is a bit higher, but both are still under 1%.

Finally, the predictions for March, April, and May 2014 are 4,205, 4,320, and 4,437.

= EXP(0.0269 * 27) * 2033.9 = 4205

= EXP(0.0269 * 28) * 2033.9 = 4320

= EXP(0.0269 * 29) * 2033.9 = 4437

There are a number of more sophisticated techniques that can make more accurate models by taking into account seasonality and autocorrelation, and then smoothing the data to better interpret patterns. Software such as JMP and Minitab have these features built in.

Predicting the future is always risky because you’re assuming the future will have similar patterns as the past. In most cases it does and can be an excellent predictor of customer behavior. However, unusual events (outraged customers on social media, a terrorist attack, or recession) that are unpredictable can substantially affect the accuracy of your predictions. Treat predictions as a guide, not an absolute.