Phase 4 of the CRISP-DM Process Model: Modeling - dummies

Phase 4 of the CRISP-DM Process Model: Modeling

By Meta S. Brown

Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that most data miners like best. Your data is already in good shape, and now you can search for useful patterns in your data.

The modeling phase includes four tasks. These are

  • Selecting modeling techniques

  • Designing test(s)

  • Building model(s)

  • Assessing model(s)

Task: Selecting modeling techniques

The wonderful world of data mining offers oodles of modeling techniques, but not all of them will suit your needs. Narrow the list based on the kinds of variables involved, the selection of techniques available in your tools, and any business considerations that are important to you.

For example, many organizations favor methods with output that’s easy to interpret, so decision trees or logistic regression might be acceptable, but neural networks would probably not be accepted.

Deliverables for this task include two reports:

  • Modeling technique: Specify the technique(s) that you will use.

  • Modeling assumptions: Many modeling techniques are based on certain assumptions. For example, a model type may be intended for use with data that has a specific type of distribution. Document these assumptions in this report.

Statisticians are well-informed, strict, and fussy about assumptions. That’s not necessarily true of data miners, and it’s not a requirement to become a data miner. If you have deep statistical knowledge and understand the assumptions behind the models you select, you can be strict and fussy about assumptions.

But many data miners, especially novice data miners, don’t fuss much over assumptions. The alternative is testing — lots and lots of testing — of your models.

Task: Designing tests

The test in this task is the test that you’ll use to determine how well your model works. It may be as simple as splitting your data into a group of cases for model training and another group for model testing.

Training data is used to fit mathematical forms to the data model, and test data is used during the model-training process to avoid overfitting: making a model that’s perfect for one dataset, but no other. You may also use holdout data, data that is not used during the model-training process, for an additional test.

The deliverable for this task is your test design. It need not be elaborate, but you should at least take care that your training and test data are similar and that you avoid introducing any bias into the data.

Task: Building model(s)

Modeling is what many people imagine to be the whole job of the data miner, but it’s just one task of dozens! Nonetheless, modeling to address specific business goals is the heart of the data-mining profession.

Deliverables for this task include three items:

  • Parameter settings: When building models, most tools give you the option of adjusting a variety of settings, and these settings have an impact on the structure of the final model. Document these settings in a report.

  • Model descriptions: Describe your models. State the type of model (such as linear regression or neural network) and the variables used. Explain how the model is interpreted. Document any difficulties encountered in the modeling process.

  • Models: This deliverable is the models themselves. Some model types can be easily defined with a simple equation; others are far too complex and must be transmitted in a more sophisticated format.

Task: Assessing model(s)

Now you will review the models that you’ve created, from a technical standpoint and also from a business standpoint (often with input from business experts on your project team).

Deliverables for this task include two reports:

  • Model assessment: Summarizes the information developed in your model review. If you have created several models, you may rank them based on your assessment of their value for a specific application.

  • Revised parameter settings: You may choose to fine-tune settings that were used to build the model and conduct another round of modeling and try to improve your results.

Data mining, like an onion, a Dobos torte, or a sedimentary rock, has lots of layers. When you are just getting started in data mining, you can start by leaving parameter settings at their default values (in fact, you might not even notice options unless you make an effort to look for them).

As you get comfortable in your new data-mining career, it will make sense for you to find out about model parameters and know how you can use them. Your options will vary widely with the type of model and specific tool that you are using.