Using Relevant Data for Predictive Analytics: Avoid “Garbage In, Garbage Out”

Predictive analytics begins with good data. More data doesn't necessarily mean better data. A successful predictive analytics project requires, first and foremost, relevant and accurate data.

Keeping it simple isn't stupid

If you're trying to address a complex business decision, you may have to develop equally complex models. Keep in mind, however, that an overly complex model may degrade the quality of those precious predictions you're after, making them more ambiguous. The simpler you keep your model, the more control you have over the quality of the model's outputs.

Limiting the complexity of the model depends on knowing what variables to select before you even start building it — and that consideration leads right back to the people with domain knowledge. Your business experts are your best source for insights into what variables have direct impact on the business problem you're trying to solve. Also, you can decide empirically on what variables to include or exclude.

Use those insights to ensure that your training dataset includes most (if not all) the possible data that you expect to use to build the model.

Data preparation puts the good stuff in

To ensure high data quality as a factor in the success of the model you're building, data preparation and cleaning can be of enormous help. When you're examining your data, pay special attention to

Data that was automatically collected (for example, from web forms)
Data that didn't undergo thorough screening
Data collected via a controlled process
Data that may have out-of-range values, data-entry errors, and/or incorrect values

Common mistakes that lead to the dreaded “garbage in, garbage out” scenario include these classic goofs:

Including more data than necessary
Building more complex models than necessary
Selecting bad predictor variables or features in your analysis
Using data that lacks sufficient quality and relevance

About This Article

About the book authors:

Anasse Bari, Ph.D. is data science expert and a university professor who has many years of predictive modeling and data analytics experience.

Mohamed Chaouchi is a veteran software engineer who has conducted extensive research using data mining methods.

Tommy Jung is a software engineer with expertise in enterprise web applications and analytics.

This article can be found in the category:

General Data Science ,

Using Relevant Data for Predictive Analytics: Avoid “Garbage In, Garbage Out”

Predictive Analytics For Dummies

Keeping it simple isn't stupid

Data preparation puts the good stuff in

About This Article

This article is from the book:

About the book authors:

This article can be found in the category:

Article Categories

Book Categories

Collections

Using Relevant Data for Predictive Analytics: Avoid “Garbage In, Garbage Out”

Predictive Analytics For Dummies

Keeping it simple isn't stupid

Data preparation puts the good stuff in

About This Article

This article is from the book:

About the book authors:

This article can be found in the category: