Understanding Data Classification and Its Role in Predictive Analytics
Data mining is a necessary part of predictive analytics. In data mining, data classification is the process of labeling a data item as belonging to a class or category. A data item is also referred to (in the data-mining vocabulary) as data object, observation, or instance.
Data clustering is different from data classification:
- Data clustering is used to describe data by extracting meaningful groupings or categories from a body of data that contains similar elements.
- Data classification is used to predict the category or the grouping that a new and incoming data object belongs to.
You can use data classification to predict new data elements on the basis of the groupings discovered by a data clustering process. Keep reading to find out how to use data classification to solve a practical problem.
A loan can serve as an everyday example of data classification. The loan officer needs to analyze loan applications to decide whether the applicant will be granted or denied a loan.
One way to make such a critical decision is to use a classifier to assist with the decision-making process. In essence, the classifier is simply an algorithm that contains instructions that tell a computer how to analyze the information mentioned in the loan application, and how to reference other (outside) sources of information on the applicant. Then the classifier can label the loan application as fitting one of these sample categories, such as “safe,” “too risky,” or “safe with conditions” assuming that these exact categories are known and labeled in the historical data.
By removing most of the decision process from the hands of the loan officer or underwriter, the model reduces the human work effort and the company’s portfolio risk. This increases return on investment (ROI) by allowing employees to originate more loans and/or get better pricing if the company decides to resell the higher-quality loan.
If the classifier comes back with a result that labels an applicant as “safe with conditions,” then the loan processor or officer can request that the applicant fulfill the conditions in order to get the loan. If the conditions are met, then the new data can be run through the classifier again for approval. Using machine learning, the loan application classifier will learn from past applications, leverage current information mentioned in the application, and predict the future behavior of the loan applicant.
As ever, predicting the future is about learning from the past and evaluating the present. Data classification relies on both past and current information — and speeds up decision making by using both faster.
To illustrate the use of classifiers in marketing, consider a marketer who has been assigned to design a smart marketing strategy for a specific product. Understanding the customers’ demographics drives the design of an effective marketing strategy. Ultimately it helps the company select suitable products to advertise to the customers most likely to buy them.
For instance, one of the criteria you can use to select targeted customers is specific geographical location. You may have an unknown store (or a store that isn’t known for selling a particular product — say, a housewares store that could start selling a new food processor) and you want to start a marketing campaign for the new product line.
Using data you collected or bought from a marketing agency, you can build your classifier. You can design a classifier that anticipates whether customers will buy the new product. For each customer profile, the classifier predicts a category that fits each product line you run through it, labeling the customer as (say) “interested,” “not interested,” or “unknown.”
Using the analysis produced by the classifier, you can easily identify the geographical locations that have the most customers who fit the “interested” category. Your model discovers for example that the population of San Francisco includes a large number of customers who have purchased a product similar to what you have for sale.
You jump at this chance to take action on the insight your model just presented to you. You may send an advertisement for that cool new gadget to those customers — and only to them.
To limit operating and marketing costs, you must avoid contacting uninterested customers; the wasted effort would affect your ROI. For that matter, having too much unnecessary contact with customers can dilute the value of your marketing campaigns and increase customer fatigue till your solicitations seem more like a nuisance. You don’t want your glossy flyers to land immediately in the garbage can or your e-mails to end up in the spam folder.
As a marketer, you might want to use data about potential customers’ profiles that has been collected from different sources or provided by a third party. Such sources include social media and databases of historical online transactions by customers.
In the medical field, a classifier can help a physician settle on the most suitable treatment for a given patient. It can be designed to analyze the patient data, learn from it, and classify the patient as belonging to a category of similar patients. The classifier can approve recommending the same treatment that helped similar patients of the same category in the past.
As in the examples previously described, the classifier predicts a label or class category for the input, using both past and current data. In the case of healthcare, the predictive model can use more data, more quickly, to help the physician arrive at an effective treatment.
To help physicians prescribe individualized medicine, the classifier would assist them in determining the specific stage of a patient’s disease. Hypothetically, the data (say, genetic analysis from a blood sample) could be fed to a trained classifier that could label the stage of a new patient’s illness. In the case of cancer (for example), the classifier could have such labels — describing the following classes or groupings — as “healthy,” “benign,” “early stage,” “metastatic,” or “terminal.”
Future uses of classifiers promise to be even more ambitious. Suppose you want to predict how much a customer will spend on a specific date. In such a case, you design a classifier that predicts numerical values rather than specified category names. Of course, numeric predictors can be developed using not only statistical methods such as regression but also other data-mining techniques such as neural networks. Given sufficiently sophisticated designs, classifiers are commonly used in fields such as presidential elections, national security, and climate change.