Phase 3 of the CRISP-DM Process Model: Data Preparation

By Meta S. Brown

Data miners spend most of their time on the third phase of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model: data preparation. Most data used for data mining was originally collected and preserved for other purposes and needs some refinement before it is ready to use for modeling.

The data preparation phase includes five tasks. These are

  • Selecting data

  • Cleaning data

  • Constructing data

  • Integrating data

  • Formatting data

The CRISP-DM step-by-step guide does not explicitly mention datasets as deliverables for each of the data preparation tasks, but those datasets had darn well better exist and be properly archived and documented. Datasets won’t correspond one-to-one with tasks, but information about the data used should be included in each deliverable report.

Task: Selecting data

Now you will decide which portion of the data that you have is actually going to be used for data mining.

The deliverable for this task is the rationale for inclusion and exclusion. In it, you’ll explain what data will, and will not, be used for further data-mining work.

You’ll explain the reasons for including or excluding each part of the data that you have, based on relevance to your goals, data quality, and technical issues — such as limits to the number of fields or rows that your tools can handle, or the suitability of the data formats for your needs.

Task: Cleaning data

The data that you’ve chosen to use is unlikely to be perfectly clean (error-free). You’ll make changes, perhaps tracking down sources to make specific data corrections, excluding some cases or individual cells (items of data), or replacing some items of data with default values or replacements selected by a more sophisticated modeling technique. You may choose to use only subsets of the data for all or some of your data-mining work.

The deliverable for this task is the data-cleaning report, which documents, in excruciating detail, every decision and action used to clean your data. This report should cover and refer to each data quality problem that was identified in the verify data quality task in the data-understanding phase of the process. You report should also address the potential impact on results of the choices you have made during data cleaning.

Task: Constructing data

You may need to derive some new fields (for example, use the delivery date and the date when a customer placed an order to calculate how long the customer waited to receive an order), aggregate data, or otherwise create a new form of data.

Deliverables for this task include two reports:

  • Derived attributes: A report that describes what new fields (columns) you have constructed, how you did it, and why.

  • Generated records: A report that describes what new cases (rows) you have constructed, how you did it, and why.

Although the merge data and format data tasks are listed last in this phase of the process, they don’t always come last, and they may not come up just once. You might have to do some merging or reformatting early in the data preparation phase.

Task: Integrating data

Your data may now be in several disparate datasets. You’ll need to merge some or all of those disparate datasets together to get ready for the modeling phase.

The deliverable for this task is the merged data. (And it would not hurt to document how the merge was performed.)

Task: Formatting data

Data often comes to you in formats other than the ones that are most convenient for modeling. (Format changes are usually driven by the design of your tools.) So convert those formats now.

The deliverable for this task is your reformatted data. (And a little report describing the changes you have made would be a smart thing to include.)

You should end the data preparation phase of the data-mining process with a dataset ready for modeling and a thorough report describing the dataset.