Gathering and Cleaning Data for Machine Learning

By John Paul Mueller, Luca Massaron

Although machines learn from data, no magic recipe exists in the world of algorithms (as the “no free lunch” theorem states) when it comes to data. Even sophisticated and advanced learning functions hit the wall and underperform when you don’t support them with the following:

  • Large enough quantities of data that are suitable for the algorithm you use
  • Clean, well-prepared data suitable for use in machine learning

Data quantity is beneficial in learning when it explains bias and variance trade-offs. As a reminder, large quantities of data can prove beneficial to learning purposes when the variability of the estimates is a problem, because the specific data used for learning heavily influences predictions (the overfitting problem). More data can really help because a larger number of examples aids machine learning algorithms to disambiguate the role of each signal picked up from data and taken into modeling the prediction.

Besides data quantity, the need for data cleanliness is understandable — it’s just like the quality of teaching you get at school. If your teachers teach you only nonsense, make erroneous examples, spend time joking, and in other ways don’t take teaching seriously, you won’t do well on your examinations no matter how smart you are. The same is true for both simple and complex algorithms — if you feed them garbage data, they just produce nonsense predictions.

According to the principle of garbage in, garbage out (GIGO for short), bad data can truly harm machine learning. Bad data consists of missing data, outliers, skewed value distributions, redundancy of information, and features not well explicated.

Bad data may not be bad in the sense that it’s wrong. Quite often, bad data is just data that doesn’t comply with the standards you set for your data: a label written in many different ways; erratic values spilled over from other data fields; dates written in invalid formats; and unstructured text that you should have structured into a categorical variable.

Enforcing rules of data validity in your databases and working on the design of better data tables as well as the exactness of the process that stores data can prove of invaluable help for machine learning and let you concentrate on solving trickier data problems.