Consider Data Misalignments When Working with AI

By John Paul Mueller, Luca Massaron

When collecting data for artificial intelligence algorithms, you must consider data misalignments and how to correct them. Data might exist for each of the data records in a dataset, but it might not align with other data in other datasets you own. For example, the numeric data in a field in one dataset might be a floating-point type (with decimal point), but an integer type in another dataset. Before you can combine the two datasets, the fields must contain the same type of data.

All sorts of other kinds of misalignment can occur. For example, date fields are notorious for being formatted in various ways. To compare dates, the data formats must be the same. However, dates are also insidious in their propensity for looking the same, but not being the same. For example, dates in one dataset might use Greenwich Mean Time (GMT) as a basis, while the dates in another dataset might use some other time zone. Before you can compare the times, you must align them to the same time zone. It can become even weirder when dates in one dataset come from a location that uses Daylight Saving Time (DST), but dates from another location don’t.

Even when the data types and format are the same, other data misalignments can occur. For example, the fields in one dataset may not match the fields in the other dataset. In some cases, these differences are easy to correct. One dataset may treat first and last name as a single field, while another dataset might use separate fields for first and last name. The answer is to change all datasets to use a single field or to change them all to use separate fields for first and last name. Unfortunately, many misalignments in data content are harder to figure out. In fact, it’s entirely possible that you might not be able to figure them out at all. However, before you give up, consider these potential solutions to the problem:

  • Calculate the missing data from other data that you can access.
  • Locate the missing data in another dataset.
  • Combine datasets to create a whole that provides consistent fields.
  • Collect additional data from various sources to fill in the missing data.
  • Redefine your question so that you no longer need the missing data.