How to Find Value in Your Predictive Analysis Data - dummies

How to Find Value in Your Predictive Analysis Data

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

Any successful journey takes serious preparation. Predictive analytics models are essentially a deep dive into large amounts of data. If the data is not well prepared, the predictive analytics model will emerge from the dive with no fish. The key to finding value in predictive analytics is to prepare the data — thoroughly and meticulously — that your model will use to make predictions.

Processing data beforehand can be a stumbling block in the predictive analytics process. Gaining experience in building predictive models — and, in particular, preparing data — teaches the importance of patience. Selecting, processing, cleaning, and preparing the data is laborious. It’s the most time-consuming task in the predictive analytics lifecycle. However, proper and systematic preparing of the data will significantly increase the chance that your data analytics will bear fruit.

Although it takes both time and effort to build that first predictive model, once you take the first step — building the first model that finds value in your data — then future models will be less resource-intensive and time-consuming, even with completely new datasets. Even if you don’t use the same data for the next model, your data analysts will have gained valuable experience with the first model.

How to delve into your predictive analysis data

Using a fruit analogy, you not only have to remove the bad peel or the cover, but dig into it to get to the nucleus; as you get closer to the nucleus, you get to the best part of the fruit. The same rule applies to big data.


Basics of predictive analysis data validity

Data is not always valid when you first encounter it. Most data is either incomplete (missing some attributes or values) or noisy (containing outliers or errors). In the biomedical bioinformatics fields, for example, outliers can lead the analytics to generate incorrect or misleading results.

Outliers in cancer data, for example, can be a major factor that skews the accuracy of medical treatments: Gene-expression samples may appear as false cancer positives because they were analyzed against a sample that contained errors.

Inconsistent data is data that contains discrepancies in data attributes. For example, a data record may have two attributes that don’t match: say, a zip code (such as 20037) and a corresponding state (Delaware). Invalid data can lead to wrong predictive modeling, which leads to misleading analytical results that will cause bad executive decisions.

For instance, sending coupons for diapers to people who have no children is a fairly obvious mistake. But it can happen easily if the marketing department of a diaper company ends up with invalid results from their predictive analytics model.

Gmail might not always suggest the right people if you’re trying to fill in the prospective customers you might have forgotten to include in a group e-mail list. Facebook, to give another example, may suggest friends who might not be the type you’re looking for.

In such cases, it’s possible that there’s too large a margin of error in the models or algorithms. In most cases, the flaws and anomalies are in the data initially selected to power the predictive model — but the algorithms that power the predictive model might have large chunks of invalid data.

Basics of data variety in predictive analysis

The absence of uniformity in data is another big challenge known as data variety. From the endless stream of unstructured text data (generated through e-mails, presentations, project reports, texts, tweets) to structured bank statements, geolocation data, and customer demographics, companies are starving for this variety of data.

Aggregating this data and preparing it for analytics is a complex task. How can you integrate data generated from different systems such as Twitter,, Google search, and a third party that tracks customer data? Well, the answer is that there is no common solution. Every situation is different, and the data scientist usually has to do a lot of maneuvering to integrate the data and prepare it for analytics.

Even so, a simple approach to standardization can support data integration from different sources: You agree with your data providers to a standard data format that your system can handle — a framework that can make all your data sources generate data that’s readable by both humans and machines. Think of it as a new language that all big-data sources will speak every time they’re in the big-data world.