How to Clean Data for Predictive Analysis
Before running a predictive analysis, you’ll need to make sure that the data is clean of extraneous stuff before you can use it in your model. This includes finding and correcting any records that contain erroneous values, and attempting to fill in any missing values. You’ll also need to decide whether to include duplicate records (two customer accounts, for example).
The overall goal is to ensure the integrity of the information you’re using to build your predictive model. Pay particular attention to the completeness, correctness, and timeliness of the data.
It’s useful to create descriptive statistics (quantitative characteristics) for various fields, such as calculating min and max, checking frequency distribution (how often something occurs) and verifying the expected ranges. Running a regular check can help you flag any data that is outside the expected range for further investigation. Any records showing retirees with birth dates in the 1990s can be flagged by this method.
Also, cross-checking the information is important so that you make sure the data is accurate. For deeper analysis of the data characteristics and the identification of the relationship between data records, you can make use of data profiling (analyzing data availability and gathering statistics on the data quality), and visualization tools.
Missing data could be due to the fact that particular information was not recorded. In such a case, you can attempt to fill in as much as you can; suitable defaults can easily be added to fill the blanks of certain fields.
For example, for patients in a hospital maternity ward where the gender field is missing a value, the application can simply fill it in as female. For that matter, for any male who was admitted to a hospital with a missing record for the pregnancy status, that record can similarly be filled in as not applicable.
A missing zip code for an address can be inferred from the street name and the city provided in that address.
In the cases where the information is unknown or cannot be inferred, then you would need to use values other than a blank space to indicate that the data is missing without affecting the correctness of the analysis. A blank in the data can mean multiple things, most of them not good or useful. Whenever you can, you should specify the nature of that blank by meaningful place filler.
Just as it’s possible to define a rose in a cornfield as a weed, outliers can mean different things to different analyses. It’s common for some models to be built solely to track those outliers and flag them.
Fraud-detection models and criminal activities monitoring are interested in those outliers, which in such cases indicate something unwanted taking place. So keeping the outliers in the dataset in cases like these is recommended. However, when outliers are considered anomalies within the data — and will only skew the analyses and lead to erroneous results — remove them from your data.
Duplication in the data can also be useful or a nuisance; some of it can be necessary, can indicate value, and can reflect an accurate state of the data. For example, a record of a customer with multiple accounts can be represented with multiple entries that are (technically, anyway) duplicate and repetitive of the same records.
By the same token, when the duplicate records don’t contribute value to the analysis and aren’t necessary, then removing them can be of tremendous value. This is especially true for large datasets where removing duplicate records can simplify the complexity of the data and reduce the time needed for analysis.
You can pre-emptively prevent incorrect data from entering your systems by adopting some specific procedures:
Institute quality checks and data validation for all data being collected.
Allow your customers to validate and self-correct their personal data.
Provide your clients with possible and expected values to choose from.
Routinely run checks on the integrity, consistency, and accuracy of the data.