How to Decide whether to Keep Outliers in Predictive Analytics

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

Deciding to include outliers in the analysis — or to exclude them — will have implications for your predictive analytics model. Keeping outliers as part of the data in your analysis may lead to a model that’s not applicable — either to the outliers or to the rest of the data.

If you decide to keep an outlier, you’ll need to choose techniques and statistical methods that excel at handling outliers without influencing the analysis. One such technique is to use mathematical functions such as natural algorithms and square root to reduce the gap between the outliers and the rest of the data.

These functions, however, only work for numerical data that is greater than zero — and other issues may arise. For example, transforming the data may require interpretations of the relationship between variables in the newly transformed data that differ from the interpretation that governs those variables in the original data.

The mere presence of outliers in your data can provide insights into your business that can be very helpful in generating a robust model. Outliers may draw attention to a valid business case that illustrates an unusual bit significant event.

Looking for outliers, identifying them, and assessing their impact should be part of data analysis and preprocessing. Business domain experts can provide insight and help you decide what to do with unusual cases in your analysis. Although sometimes common sense is all you need to deal with outliers, often it’s helpful to ask someone who knows the ropes.

If you’re in a business that benefits from rare events — say, an astronomical observatory with a grant to study Earth-orbit-crossing asteroids — you’re more interested in the outliers than in the bulk of the data.

Outliers can be a great source of information. Deviating from the norm could be a signal of suspicious activity, breaking news, or an opportunistic or catastrophic event. You may need to develop models that help you identify outliers and asses the risks they signify.

It’s prudent to conduct two analyses: one that includes outliers, and another that omits them. Then examine the differences, try to understand the implications of each method, and assess how adopting one method over the other would influence your business goals.