How to Outline Testing and Test Data for Predictive Analytics

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

When your data is ready and you’re about to start building your predictive model for analysis, it’s useful to outline your testing methodology and draft a test plan. Testing should be driven by the business goals you’ve gathered, documented, and collected all necessary data to help you achieve.

Right off the bat, you should devise a method to test whether a business goal has been attained successfully. Since predictive analytics measure the likelihood of a future outcome — and the only way to be ready to run such a test is by training your model on past data, you still have to see what it can do when it’s up against future data.

Of course, you can’t risk running an untried model on real future data, so you’ll need to use existing data to simulate future data realistically. To do so, you have to split the data you’re working on into training and test datasets.

Be sure that you select these two datasets at random, and that both datasets contain and cover all the data parameters you’re measuring.

When you split your data into test and training datasets, you’re effectively avoiding any overfitting issues that could arise from overtraining the model on the entire dataset and picking up all the noise patterns or specific features that only belong to the sample dataset and are not applicable to other datasets.

Separating your data into training and test datasets, about 70 percent and 30 percent respectively, ensures an accurate measurement of the performance of the predictive analytics model you’re building. You want to evaluate your model against the test data because it’s a straightforward way to measure whether the model’s predictions are accurate.