Classification Algorithms Used in Data Science

By Lillian Pierson

With classification algorithms, you take an existing dataset and use what you know about it to generate a predictive model for use in classification of future data points. If your goal is to use your dataset and its known subsets to build a model for predicting the categorization of future data points, you’ll want to use classification algorithms.

When implementing supervised classification, you should already know your data’s subsets — these subsets are called categories. Classification helps you see how well your data fits into the dataset’s predefined categories so that you can then build a predictive model for use in classifying future data points.

The figure illustrates how it looks to classify the World Bank’s Income and Education datasets according to the Continent category.

image0.jpg

You can see that, in some cases, the subsets you might identify with a clustering technique do correspond to the continents category, but in other cases, they don’t. For example, look at the one Asian country in the middle of the African data points. That’s Bhutan. You could use the data in this dataset to build a model that would predict a continent category for incoming data points.

But if you introduced a data point for a new country that showed statistics similar to those of Bhutan, then the new country could be categorized as being part of either the Asian continent or the African continent, depending on how you define your model.

Now imagine a situation in which your original data doesn’t include Bhutan, and you use the model to predict Bhutan’s continent as a new data point. In this scenario, the model would wrongly predict that Bhutan is part of the African continent.

This is an example of model overfitting — situations in which a model is so tightly fit to its underlying dataset, as well as the noise or random error inherent in that dataset, that the model performs poorly as a predictor for new data points.

To avoid overfitting your models, divide your data into a training set and a test set. A typical ratio is to assign 80 percent of the data into the training set and the remaining 20 percent into the test set. Build your model with the training set, and then use the test set to evaluate the model by pretending that the test-set data points are unknown. You can evaluate the accuracy of your model by comparing the categories assigned to these test-set data points by the model to the true categories.

Model overgeneralization can also be a problem. Overgeneralization is the opposite of overfitting: It happens when a data scientist tries to avoid ­misclassification due to overfitting by making a model extremely general. Models that are too general end up assigning every category a low degree of confidence.

To illustrate model overgeneralization, consider again the World Bank Income and Education datasets. If the model used the presence of Bhutan to cast doubt on every new data point in its nearby vicinity, then you end up with a wishy-washy model that treats all nearby points as African but with a low probability. This model would be a poor predictive performer.

A good metaphor for overfitting and overgeneralization can be illustrated through the well-known phrase, “If it walks like a duck and talks like a duck, then it’s a duck.” Overfitting would turn this phrase into, “It’s a duck if, and only if, it walks and quacks exactly in the ways that I have personally observed a duck to walk and quack. Since I’ve never observed the way an Australian spotted duck walks and quacks, an Australian spotted duck must not really be a duck at all.”

In contrast, overgeneralization would say, “If it moves around on two legs and emits any high-pitched, nasal sound, it’s a duck. Therefore, Fran Fine, Fran Drescher’s character in the ’90s American sitcom The Nanny must be a duck.”

Supervised machine learning — the fancy term for classification — is appropriate in situations where the following characteristics are true:

  • You know and understand the dataset you’re analyzing.

  • The subsets (categories) of your dataset are defined ahead of time and aren’t determined by the data.

  • You want to build a model that correlates the data within its predefined categories so that the model can help predict the categorization of future data points.

When performing classification, keep the following points in mind:

  • Model predictions are only as good as the model’s underlying data. In the World Bank data example, it could be the case that, if other factors such as life expectancy or energy use per capita were added to the model, its predictive strength might increase.

  • Model predictions are only as good as the categorization of the underlying dataset. For example, what do you do with countries like Russia that span two continents? Do you distinguish North Africa from sub-Saharan Africa? Do you lump North America in with Europe because they tend to share similar attributes? Do you consider Central America to be part of North America or South America?

There is a constant danger of overfitting and overgeneralization. A happy medium must be found between the two.