Using Relevant Data for Predictive Analytics: Avoid “Garbage In, Garbage Out”

By Dr. Anasse Bari, Mohamed Chaouchi, Tommy Jung

Predictive analytics begins with good data. More data doesn’t necessarily mean better data. A successful predictive analytics project requires, first and foremost, relevant and accurate data.

Keeping it simple isn’t stupid

If you’re trying to address a complex business decision, you may have to develop equally complex models. Keep in mind, however, that an overly complex model may degrade the quality of those precious predictions you’re after, making them more ambiguous. The simpler you keep your model, the more control you have over the quality of the model’s outputs.

Limiting the complexity of the model depends on knowing what variables to select before you even start building it — and that consideration leads right back to the people with domain knowledge. Your business experts are your best source for insights into what variables have direct impact on the business problem you’re trying to solve. Also, you can decide empirically on what variables to include or exclude.

Use those insights to ensure that your training dataset includes most (if not all) the possible data that you expect to use to build the model.

Data preparation puts the good stuff in

To ensure high data quality as a factor in the success of the model you’re building, data preparation and cleaning can be of enormous help. When you’re examining your data, pay special attention to

  • Data that was automatically collected (for example, from web forms)
  • Data that didn’t undergo thorough screening
  • Data collected via a controlled process
  • Data that may have out-of-range values, data-entry errors, and/or incorrect values

Common mistakes that lead to the dreaded “garbage in, garbage out” scenario include these classic goofs:

  • Including more data than necessary
  • Building more complex models than necessary
  • Selecting bad predictor variables or features in your analysis
  • Using data that lacks sufficient quality and relevance