Data Clustering with the k-Means Algorithm

By Lillian Pierson

You generally deploy k-means algorithms to subdivide data points of a dataset into clusters based on nearest mean values. To determine the optimal division of your data points into clusters, such that the distance between points in each cluster is minimized, you can use k-means clustering.

In the term k-means, k denotes the number of clusters in the data. Since the k-means algorithm doesn’t determine this, you’re required to specify this quantity. The quality of the clusters is heavily dependent on the correctness of the k value specified. If your data is two- or three-dimensional, a plausible range of k values may be visually determinable.

In the eyeballed approximation of clustering from the World Bank Income and Education data scatter plot, a visual estimation of the k value would equate to 3 clusters, or k = 3.

image0.jpg

If your dataset has more than three dimensions, however, you can use computational methods to generate a good value for k. One such method is the silhouette coefficient — a method that calculates the average distance of each point from all other points in a cluster, and then compares that value with the average distance to every point in every other cluster. Luckily, since the k-means algorithm is so efficient, it does not require much computer processing power, and you can easily calculate this coefficient for a wide range of k values.

The k-means algorithm works by placing sample cluster centers on an n-dimensional plot and then evaluating whether moving them in any one direction would result in a new center with higher density — with more data points closer to it, in other words.

The centers are moved from regions of lower density to regions of higher density until all centers are within a region of local maximum density — a true center of the cluster, where each cluster gets a maximum number of points closest to its cluster center.

Whenever possible, you should try to place the centers yourself, manually. If that’s not possible, then simply place the centers randomly and run the algorithm several times to see how often you end up with the same clusters.

One weakness of the k-means algorithm is that it may produce incorrect results by placing cluster centers in areas of local minimum density. This happens when centers get lost in low-density regions — in other words, regions of the plot that have relatively few points plotted in them — and the algorithm-driven directional movement — the movement that’s meant to increase point density — starts to bounce and oscillate between faraway clusters.

In these cases, the center gets caught in a low-density space that’s located between two high-point density zones.This results in erroneous clusters based around centers that converge in areas of low, local minimum density. Ironically, this happens most often when the underlying data is very well-clustered, with tight, dense regions that are separated by wide, sparse areas.

To try things out for yourself, you can get started clustering your data with the k-means methods by using either R’s cluster package or Python’s SciPy library.