Basics of Predictive Analytics Data-Classifications Process

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

At a brass-tacks level, predictive analytic data classification consists of two stages: the learning stage and the prediction stage. The learning stage entails training the classification model by running a designated set of past data through the classifier. The goal is to teach your model to extract and discover hidden relationships and rules — the classification rules from historical (training) data. The model does so by employing a classification algorithm.

The prediction stage that follows the learning stage consists of having the model predict new class labels or numerical values that classify data it has not seen before (that is, test data).

To illustrate these stages, suppose you’re the owner of an online store that sells watches. You’ve owned the online store for quite a while, and have gathered a lot of transactional data and personal data about customers who purchased watches from your store. Suppose you’ve been capturing that data through your site by providing web forms, in addition to the transactional data you’ve gathered through operations.

You could also purchase data from a third party that provides you with information about your customers outside their interest in watches. That’s not as hard as it sounds; there are companies whose business model is to track customers online and collect and sell valuable information about them.

Most of those third-party companies gather data from social media sites and apply data-mining methods to discover the relationship of individual users with products. In this case, as the owner of a watch shop, you’d be interested in the relationship between customers and their interest in buying watches.

You can infer this type of information from analyzing, for example, a social network profile of a customer, or a microblog comment of the sort you find on Twitter.

To measure an individual’s level of interest in watches, you could apply any of several text-analytics tools that can discover such correlations in an individual’s written text (social network statuses, tweets, blog postings, and such) or online activity (such as online social interactions, photo uploads, and searches).

After you collect all that data about your customers’ past transactions and current interests — the training data that shows your model what to look for — you’ll need to organize it into a structure that makes it easy to access and use (such as a database).

At this point, you’ve reached the second phase of data classification: the prediction stage, which is all about testing your model and the accuracy of the classification rules it has generated. For that purpose, you’ll need additional historical customer data, referred to as test data (which is different from the training data).

You feed this test data into your model and measure the accuracy of the resulting predictions. You count the times that the model predicted correctly the future behavior of the customers represented in your test data. You also count the times that the model made wrong predictions.

At this point, you have only two possible outcomes: Either you’re satisfied with the accuracy of the model or you aren’t:

  • If you’re satisfied, then you can start getting your model ready to make predictions as part of a production system.

  • If you’re not happy with the prediction, then you’ll need to retrain your model with a new training dataset.

If your original training data was not representative enough of the pool of your customers — or contained noisy data that threw off the model’s results by introducing false signals — then there’s more work to do to get your model up and running. Either outcome is useful in its way.