How to Evaluate an Unsupervised Learning Model with K-Means

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

After you’ve chosen your number of clusters for predictive analytics and have set up the algorithm to populate the clusters, you have a predictive model. You can make predictions based on new incoming data by calling the predict function of the K-means instance and passing in an array of observations. It looks like this:

>>> # to call the predict function with a single observation
>>> kmeans.predict([ 5.1, 3.5, 1.4, 0.2 ])
array([1])

When the predict function finds the cluster center that the observation is closest to, it outputs the index of that cluster center’s array. Python arrays are indexed at 0 (that is, the first item starts at 0). Observations closest to a cluster center will be grouped into that cluster.

In this example, the K-means algorithm predicts that the observation belongs to Cluster 1 (Setosa in this case) — an easy prediction because the Setosa class is linearly separable and far away from the other two classes.

Also, this example includes just the very first observation from the dataset to make the prediction verifiable and easy to explain. You can see that the attributes of the observation we’re trying to predict are very close to the second cluster center (kmeans.cluster_centers_[1]).

To see the cluster centers, type the following code:

>>> kmeans.cluster_centers_
array([[ 5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
  [ 5.006  , 3.418  , 1.464  , 0.244  ],
  [ 6.85  , 3.07368421, 5.74210526, 2.07105263]])

To see the cluster labels that the K-means algorithm produces, type the following code:

>>> kmeans.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
  0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
  2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2,
  0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2,
  2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

You can also use the predict function to evaluate a set of observations, as shown here:

>>> # to call the predict method with a set of data points
>>> kmeans.predict([[ 5.1, 3.5, 1.4, 0.2 ],
     [ 5.9, 3.0, 5.1, 1.8 ]])
array([1,0])

Although you know that the three-cluster solution is technically correct, don’t be surprised if intuitively the two-cluster solution seems to look the best. If you increase the number of clusters beyond three, your predictions’ success rate starts to break down. With a little bit of luck (and some educated guessing), you’ll choose the best number of clusters.

Consider the process as mixing a little bit of art with science. Even the algorithm itself uses randomness in its selection of the initial data points it uses to start each cluster. So even if you’re guessing, you’re in good company.

Evaluating the performance of an algorithm requires a label that represents the expected value and a predicted value to compare it with. Remember that when you apply a clustering algorithm to an unsupervised learning model, you don’t know what the expected values are — and you don’t give labels to the clustering algorithm.

The algorithm puts data points into clusters on the basis of which data points are similar to one another; different data points end up in other clusters. For the Iris dataset, K-means has no concept of Setosa, Versicolor, or Virginica classes; it only knows it’s supposed to cluster the data into three clusters and name them randomly from 0 to 2.

The purpose of unsupervised learning with clustering is to find meaningful relationships in the data, preferably where you could not have seen them otherwise. It’s up to you to decide whether those relationships are a good basis for an actionable insight.