How to Create and Run an Unsupervised Learning Model to Make Predictions with K-Means

The K-means algorithm requires one initialization parameter from the user in order to create an instance for predictive analytics. It needs to know how many K clusters to use to perform its work.

Sepal Length Sepal Width Petal Length Petal Width Target Class/Label
5.1 3.5 1.4 0.2 Setosa (0)
7.0 3.2 4.7 1.4 Versicolor (1)
6.3 3.3 6.0 2.5 Virginica (2)

Since you’re using the Iris dataset, you already know that it has three clusters. The Iris dataset has three classes of the Iris flower (Setosa, Versicolor, and Virginica). In general, when you’re creating an unsupervised learning task with a clustering algorithm, you wouldn't know how many clusters to specify.

Some algorithms are available that try to determine the best number of clusters, but their results can be dubious. One such method iterates from a range of clusters and then selects a number of clusters that best fits its mathematical criteria. This approach requires heavy computation, may take a long time, and still may not produce the best K (number of clusters).

The best way to get immediate results is to make an educated guess about the number of clusters to use — basing your estimate on features present in the data (whether one or multiple features), or on some other knowledge of the data you may have from the business domain expert.

This falling back on guesswork (even educated guesswork) is a major limitation of the K-means clustering algorithm.

To create an instance of the K-means clustering algorithm and run the data through it, type the following code in the interpreter.

>>> from sklearn.cluster import KMeans
>>> kmeans = KMeans(n_clusters=3, random_state=111)
>>> kmeans.fit(iris.data)

The first line of code imports the KMeans library into the session. The second line creates the model and stores it in a variable named kmeans. The model is created with the number of clusters set to 3. The third line fits the model to the Iris data.

Fitting the model is the core part of the algorithm, where it will produce the three clusters with the given dataset and construct a mathematical function that describes the line or curve that best fits the data. To see the clusters that the algorithm produces, type the following code.

>>> kmeans.labels_

The output should look similar to this:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
  0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
  2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2,
  0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2,
  2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

This is how the K-means algorithm labels the data as belonging to clusters, without input from the user about the target values. Here the only thing K-means knew was what you provided it: the number of clusters. This result shows how the algorithm viewed the data, and what it learned about the relationships of data items to each other — hence the term unsupervised learning.

You can see right away that some of the data points were mislabeled. You know, from the Iris dataset, what the target values should be:

  • The first 50 observations should be labeled the same (as 1s in this case).

    This range is known as the Setosa class.

  • Observations 51 to 100 should be labeled the same (as 0s in this case).

    This range is known as the Versicolor class.

  • Observations 101 to 150 should be labeled the same (as 2s in this case).

    This range is known as the Virginica class.

It doesn't matter whether K-means labeled each set of 50 with a 0, 1, or 2. As long as each set of 50 has the same label, it accurately predicted the outcome. It’s up to you to give each cluster a name and to find meaning in each cluster.

If you run the K-means algorithm again, it may produce an entirely different number for each set of 50 — but the meaning would be the same for each set (class).

You can create a K-means model that can generate the same output each time by passing the random_state parameter with a fixed seed value to the function that creates the model. The algorithm depends on randomness to initialize the cluster centers.

Providing a fixed seed value takes away the randomness. Doing so essentially tells K-means to select the same initial data points to initialize the cluster centers, every time you run the algorithm. It is possible to get a different outcome by removing the random_state parameter from the function.

  • Add a Comment
  • Print
  • Share
blog comments powered by Disqus
Advertisement

Inside Dummies.com