How to Address Problems in Predictive Analytics

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

Predictive modeling is gaining popularity as a tool for managing many aspects of business. Ensuring that data analysis is done right will boost confidence in the models employed — which, in turn, can generate the needed buy-in for predictive analytics to become part of your organization’s standard toolkit.

Perhaps this increased popularity comes from the ways in which a predictive analytics project can support decision-making by creating models that describe datasets, discover possible new patterns and trends (as indicated by the data), and predict outcomes with greater reliability.

To accomplish this goal, a predictive analytics project must deliver a model that best fits the data by selecting the decision variables correctly and efficiently. Some vital questions must be answered en route to that goal:

  • What are the minimum assumptions and decision variables that enable the model to best fit the data?

  • How does the model under construction compare to other applicable models?

  • What criteria are best for evaluating and scoring this model?

Once again, you can call the voice of experience to the rescue: Domain knowledge experts can discuss these questions, interpret any results that show hidden patterns in the data, and help verify and validate the model’s output.

How to describe the limitations of the predictive analytics model

Any predictive analytic model has certain limitations based on the algorithms it employs and the dataset it runs on. You should be aware of those limitations and make them work to your advantage; those related to the algorithms include

  • Whether the data has nonlinear patterns (does not form a line)

  • How highly correlated the variables are (statistical relationships between features)

  • Whether the variables are independent (no relationships between features)

  • Whether the scope of the sample data makes the model prone to overfitting

To overcome the limitations of your model, use sound cross-validation techniques to test your models. Start by dividing your data into training and test datasets, and run the model against each of those datasets separately to evaluate and score the predictions of the model.

How to test and evaluate your predictive analytics model

No model can produce 100-percent accurate forecasts; any model has the potential to produce inaccurate results. Be on the lookout for any significant variation between the forecasts your model produces and the observed data — especially if the model’s outputs contradict common sense. If it looks too good, bad, or extreme to be true, then it probably isn’t true (to reality, anyway).

In the evaluation process, thoroughly examine the outputs of the models you’re testing and compare them to the input variables. Your model’s forecast capability should answer all stated business goals that drove its creation in the first place.

If errors or biases crop up in your model’s output, try tracing them back to

  • The validity, reliability, and relative seasonality of the data

  • Assumptions used in the model

  • Variables that were included or excluded in the analysis

Work with business users to evaluate every step of your model’s process; make sure that the model outputs can be easily interpreted and used in a real-world business situation. Balance the accuracy and reliability of the model with how easily the model’s outputs can be interpreted and put to practical use.

How to avoid non-scalable predictive analytics models

When you’re building a model, always keep scalability in mind. Always check the performance, accuracy, and reliability of the model at various scales. Your model should be able to change its scale — and scale up as big as necessary — without falling apart or outputting bad predictions.

Scalability was quite a challenge in the past. Predictive models took a long time to build and to run. The datasets the models ran on were small, and the data was expensive to collect, store, and search. But that was all in the “pre-big data” era.

Today big data is cheap, plentiful, and growing. In fact, another potential problem looms: The formidable data volume currently available may negatively affect the model and degrade its performance, outdating the model in a relatively short period of time. Properly implemented, scalability can help “future-proof” your model.

The future isn’t the only threat. Even in the present online era, streamed data can overwhelm a model — especially if the streams of data increase to a flood.

Data volume alone can cause the decision variables and predicting factors to grow to giant numbers that require continuous updating to the model. So yes, your model had better be scalable — rapidly scalable.