The Importance of Clustering and Classification in Data Science

By Lillian Pierson

The purpose of clustering and classification algorithms is to make sense of and extract value from large sets of structured and unstructured data. If you’re working with huge volumes of unstructured data, it only makes sense to try to partition the data into some sort of logical groupings before attempting to analyze it.

Clustering and classification allows you to take a sweeping glance of your data en masse, and then form some logical structures based on what you find there before going deeper into the nuts-and-bolts analysis.

In their simplest form, clusters are sets of data points that share similar attributes, and clustering algorithms are the methods that group these data points into different clusters based on their similarities. You’ll see clustering algorithms used for disease classification in medical science, but you’ll also see them used for customer classification in marketing research and for environmental health risk assessment in environmental engineering.

There are different clustering methods, depending on how you want your dataset to be divided. The two main types of clustering algorithms are

  • Hierarchical: Algorithms create separate sets of nested clusters, each in their own hierarchal level.

  • Partitional: Algorithms create just a single set of clusters.

You can use hierarchical clustering algorithms only if you already know the separation distance between the data points in your dataset. The k-nearest neighbor algorithm that’s described in this chapter belongs to the hierarchical class of clustering algorithms.

You might have heard of classification and thought that classification is the same thing as clustering. Many people do, but this is not the case. In classification, before you start, you already know the number of classes into which your data should be grouped and you already know what class you want each data point to be assigned. In classification, the data in the dataset being learned from is labeled.

When you use clustering algorithms, on the other hand, you have no predefined concept for how many clusters are appropriate for your data, and you rely upon the clustering algorithms to sort and cluster the data in the most appropriate way. With clustering techniques, you’re learning from unlabeled data.

To better illustrate the nature of classification, though, take a look at Twitter and its hash-tagging system. Say you just got hold of your favorite drink in the entire world: an iced caramel latte from Starbucks. You’re so happy to have your drink that you decide to tweet about it with a photo and the phrase “This is the best latte EVER! #StarbucksRocks.” Well, of course, you include “#StarbucksRocks” in your tweet so that the tweet goes into the #StarbucksRocks stream and is classified together with all the other tweets that have been labeled as #StarbucksRocks. Your use of the hashtag label in your tweet told Twitter how to classify your data into a recognizable and accessible group, or cluster.