How to Deal with Outliers Caused by Errors in the System
When you rely on technology or instrumentation to conduct a predictive analytics task, a glitch here or there can cause these instruments to register extreme or unusual values. If sensors register observational values that fail to meet basic quality-control standards, they can produce real disruptions that are reflected in data.
Someone performing data entry, for example, can easily add an extra 0 at the end of a value by mistake, taking the entry out of range and producing an outlier.
If you’re looking at observational data collected by a water sensor installed in Baltimore Harbor — and it reports a water depth of 20 feet above mean sea level — you’ve got an outlier. The sensor is obviously wrong unless Baltimore is completely covered by water.
Data can end up having outliers because of external events or an error by a person or an instrument.
If a real event such as a flash crash is traced to an error in the system, its consequences are still real — but if you know the source of the problem, you may conclude that a flaw in the data, not your model, was to blame if your model didn’t predict the event.
Knowing the source of the outlier will guide your decision on how to deal with it. Outliers that were the result of data-entry errors can easily be corrected after consulting the data source. Outliers that reflect a change reality may prompt you to change your model.
There’s no one-size-fits-all answer when you’re deciding whether to include or disregard extreme data that isn’t an error or glitch. Your response depends on the nature of the analysis you’re doing — and on the type of the model you’re building. In a few cases, the way to deal with those outliers is straightforward:
If you trace your outlier to a data-entry error when you consult the data source, you can easily correct the data and (probably) keep the model intact.
If that water sensor in Baltimore Harbor reports water to a depth of 20 feet above mean sea level, and you’re in Baltimore, look out your window:
If Baltimore isn’t completely covered by water, the sensor is obviously wrong.
If you see a fish looking in at you, the reality has changed; you may have to revise your model.
The flash crash may have been a one-time event (over the short term, anyway), but its effects were real — and if you’ve studied the market over the longer term, you know that something similar may happen again. If your business is in finance and you deal with the stock market all the time, you want your model to account for such aberrations.
In general, if the outcome of an event normally considered an outlier can have a significant impact on your business, consider how to deal with those events in your analysis. Keep these general points in mind about outliers:
The smaller dataset is, the more significant the impact outliers can have on the analysis.
As you develop your model, be sure you also develop techniques to find outliers and to systematically understand their impact on your business.
Detecting outliers can be a complex process; there is no simple way of identifying them.
A domain expert (someone who knows the field you’re modeling) is your best go-to person to verify whether a data point is valid, an outlier you can disregard, or an outlier you have to take into account. The domain expert should be able to explain what factors created the outlier, what its range of variability is, and its impact on the business.
Visualization tools can help you spot outliers in the data. Also, if you know the expected range of values you can easily query for data that falls outside that range.