How to Validate Your Data with Python for Data Science

By John Paul Mueller, Luca Massaron

When it comes to data, no one really knows what a large database contains. Python can help data scientists with that issue. You must validate your data before you use it to ensure that the data is at least close to what you expect it to be.

What validation does is ensure that you can perform an analysis of the data and reasonably expect that analysis to succeed. Later, you need to perform additional massaging of the data to obtain the sort of results that you need in order to perform your task.

Figuring out what’s in your data

Finding duplicates in your data is important because you end up

  • Spending more computational time to process duplicates, which slows your algorithms down.

  • Obtaining false results because duplicates implicitly overweight the results. Because some entries appear more than once, the algorithm considers these entries more important.

As a data scientist, you want your data to enthrall you, so it’s time to get it to talk to you through the wonders of pandas, as shown in the following example:

from lxml import objectify
import pandas as pd
xml = objectify.parse(open(‘XMLData2.xml’))
root = xml.getroot()
df = pd.DataFrame(columns=(‘Number’, ‘String’, ‘Boolean’))
for i in range(0,4):
 obj = root.getchildren()[i].getchildren()
 row = dict(zip([‘Number’, ‘String’, ‘Boolean’],
     [obj[0].text, obj[1].text,
     obj[2].text]))
 row_s = pd.Series(row)
 row_s.name = i
 df = df.append(row_s)
search = pd.DataFrame.duplicated(df)
print df
print
print search[search == True]

This example shows how to find duplicate rows. It relies on a modified version of the XMLData.xml file, XMLData2.xml, which contains a simple repeated row in it. A real data file contains thousands (or more) of records and possibly hundreds of repeats, but this simple example does the job. The example begins by reading the data file into memory. It then places the data into a DataFrame.

At this point, your data is corrupted because it contains a duplicate row. However, you can get rid of the duplicated row by searching for it. The first task is to create a search object containing a list of duplicated rows by calling pd.DataFrame.duplicated(). The duplicated rows contain a True next to their row number.

Of course, now you have an unordered list of rows that are and aren’t duplicated. The easiest way to determine which rows are duplicated is to create an index in which you use search == True as the expression. Following is the output you see from this example. Notice that row 1 is duplicated in the DataFrame output and that row 1 is also called out in the search results:

Number String Boolean
0  1 First True
1  1 First True
2  2 Second False
3  3 Third True
1 True
dtype: bool

Removing duplicates

To get a clean dataset, you want to remove the duplicates from it. Fortunately, pandas does it for you, as shown in the following example:

from lxml import objectify
import pandas as pd
xml = objectify.parse(open(‘XMLData2.xml’))
root = xml.getroot()
df = pd.DataFrame(columns=(‘Number’, ‘String’, ‘Boolean’))
for i in range(0,4):
 obj = root.getchildren()[i].getchildren()
 row = dict(zip([‘Number’, ‘String’, ‘Boolean’],
     [obj[0].text, obj[1].text,
     obj[2].text]))
 row_s = pd.Series(row)
 row_s.name = i
 df = df.append(row_s)
print df.drop_duplicates()

As with the previous example, you begin by creating a DataFrame that contains the duplicate record. To remove the errant record, all you need to do is call drop_duplicates(). Here’s the result you get.

 Number String Boolean
0  1 First True
2  2 Second False
3  3 Third True

Creating a data map and data plan

You need to know about your dataset. A data map is an overview of the dataset. You use it to spot potential problems in your data, such as

  • Redundant variables

  • Possible errors

  • Missing values

  • Variable transformations

Checking for these problems goes into a data plan, which is a list of tasks you have to perform to ensure the integrity of your data. The following example shows a data map, A, with two datasets, B and C:

import pandas as pd
df = pd.DataFrame({‘A’: [0,0,0,0,0,1,1],
     ‘B’: [1,2,3,5,4,2,5],
     ‘C’: [5,3,4,1,1,2,3]})
a_group_desc = df.groupby(‘A’).describe()
print a_group_desc

In this case, the data map uses 0s for the first series and 1s for the second series. The groupby() function places the datasets, B and C, into groups. To determine whether the data map is viable, you obtain statistics using describe(). What you end up with is a dataset B, series 0 and 1, and dataset C, series 0 and 1, as shown in the following output.

    B   C
A     
0 count 5.000000 5.000000
 mean 3.000000 2.800000
 std 1.581139 1.788854
 min 1.000000 1.000000
 25% 2.000000 1.000000
 50% 3.000000 3.000000
 75% 4.000000 4.000000
 max 5.000000 5.000000
1 count 2.000000 2.000000
 mean 3.500000 2.500000
 std 2.121320 0.707107
 min 2.000000 2.000000
 25% 2.750000 2.250000
 50% 3.500000 2.500000
 75% 4.250000 2.750000
 max 5.000000 3.000000

The breakup of the two datasets using specific cases is the data plan. As you can see, the statistics tell you that this data plan may not be viable because some statistics are relatively far apart.

The output from describe() can be hard to read, but you can break it apart, as shown here:

unstacked = a_group_desc.unstack()
print unstacked

Using unstack() creates a new presentation so that you can see it better:

  B
 count mean  std min 25% 50% 75% max
A                 
0  5 3.0 1.581139 1 2.00 3.0 4.00 5
1  2 3.5 2.121320 2 2.75 3.5 4.25 5
  C
 count mean  std min 25% 50% 75% max
A     
0  5 2.8 1.788854 1 1.00 3.0 4.00 5
1  2 2.5 0.707107 2 2.25 2.5 2.75 3

Of course, you may not want all the data that describe() provides. Here’s how you reduce the size of the information output:

print unstacked.loc[:,(slice(None),[‘count’,’mean’]),]

Using loc lets you obtain specific columns. Here’s the final output from the example showing just the information you absolutely need to make a decision:

  B   C
 count mean count mean
A    
0  5 3.0  5 2.8
1  2 3.5  2 2.5