Clustering Algorithms Used in Data Science

By Lillian Pierson

You use clustering algorithms to subdivide your datasets into clusters of data points that are most similar for a predefined attribute. If you have a dataset that describes multiple attributes about a particular feature and want to group your data points according to their attribute similarities, then use clustering algorithms.

A simple scatter plot of Country Income and Education datasets yields the chart you see here.

image0.jpg

In unsupervised clustering, you start with this data and then proceed to divide it into subsets. These subsets are called clusters and are comprised of data points that are most similar to one another. It appears that there are at least two clusters, probably three — one at the bottom with low income and education, and then the high education countries look like they might be split between low and high income.

The following figure shows the result of eyeballing — making a visual estimate of — clusters in this dataset.

image1.jpg

Although you can generate visual estimates of clustering, you can achieve much more accurate results when dealing with much larger datasets by using algorithms to generate clusters for you. Visual estimation is a rough method that’s only useful on smaller datasets of minimal complexity. Algorithms ­produce exact, repeatable results, and you can use algorithms to generate clustering for multiple dimensions of data within your dataset.

Clustering algorithms are one type of approach in unsupervised machine learning — other approaches include Markov methods and methods for dimension reduction. Clustering algorithms are appropriate in situations where the following characteristics are true:

  • You know and understand the dataset you’re analyzing.

  • Before running the clustering algorithm, you don’t have an exact idea as to the nature of the subsets (clusters). Often, you won’t even know how many subsets there are in the dataset before you run the algorithm.

  • The subsets (clusters) are determined by only the one dataset you’re analyzing.

  • Your goal is to determine a model that describes the subsets in a single dataset and only this dataset.

If you add more data, you should rerun the analysis from scratch to get complete and accurate model results.