Hadoop For Dummies
Book image
Explore Book Buy On Amazon

K is an input to the algorithm for predictive analysis; it stands for the number of groupings that the algorithm must extract from a dataset, expressed algebraically as k. A K-means algorithm divides a given dataset into k clusters. The algorithm performs the following operations:

  1. Pick k random items from the dataset and label them as cluster representatives.

  2. Associate each remaining item in the dataset with the nearest cluster representative, using a Euclidean distance calculated by a similarity function.

  3. Recalculate the new clusters’ representatives.

  4. Repeat Steps 2 and 3 until the clusters do not change.

A representative of a cluster is the mathematical mean (average) of all items that belong to the same cluster. This representative is also called a cluster centroid. For instance, consider three items from the fruits dataset where

Type 1 corresponds to bananas.
Type 2 corresponds to apples.
Color 2 corresponds to yellow.
Color 3 corresponds to green.

Assuming that these items are assigned to the same cluster, the centroid of these three items is calculated.

Item Feature#1 Type Feature#2 Color Feature#3 Weight (Ounces)
1 1 2 5.33
2 2 3 9.33
3 1 2 2.1

Here are the calculations of a cluster representative of three items that belong to the same cluster. The cluster representative is a vector of three attributes. Its attributes are the average of the attributes of the items in the cluster in question.

Item Feature#1 Type Feature#2 Color Feature#3 Weight (Ounces)
1 1 2 5.33
2 2 3 9.33
3 1 2 2.1
Cluster Representative (Centroid Vector) (1+2+1)/3=1.33 (2+3+2)/3=2.33 (5.33 + 9.33 +32.1)/3=3

The dataset shown next consists of seven customers’ ratings of two products, A and B. The ranking represents the number of points (between 0 and 10) that each customer has given to a product — the more points given, the higher the product is ranked.

Using a K-means algorithm and assuming that k is equal to 2, the dataset will be partitioned into two groups. The rest of the procedure looks like this:

  1. Pick two random items from the dataset and label them as cluster representatives.

    The following shows the initial step of selecting random centroids from which the K-means clustering process begins. The initial centroids are selected randomly from the data that you are about to analyze. In this case, you’re looking for two clusters, so two data items are randomly selected: Customers 1 and 5.

    At first, the clustering process builds two clusters around those two initial (randomly selected) cluster representatives. Then the cluster representatives are recalculated; the calculation is based on the items in each cluster.

    Customer ID Customer Ratings of Product A Customer Ratings of Product B
    1 2 2
    2 3 4
    3 6 8
    4 7 10
    5 10 14
    6 9 10
    7 7 9
  2. Inspect every other item (customer) and assign it to the cluster representative to which it is most similar.

    Use the Euclidean distance to calculate how similar an item is to a group of items:

    Similarity of Item I to Cluster X = sqrt {{{left( {{f_1} - {x_1}} right)}^2} + {{left( {{f_2} - {x_2}} right)}^2} + cdots + {{left( {{f_n} - {x_n}} right)}^2}}

    The values {f_1},;{f_2},; ldots ,;{f_n} are the numerical values of the features that describe the item in question. The values {x_1},;{x_2},; ldots ,;{x_n} are the features (mean values) of the cluster representative (centroid), assuming that each item has n features.

    For instance, consider the item called Customer 2 (3, 4): The customer’s rating for Product A was 3 and the rating for Product B was 4. The cluster representative feature is (2, 2). The similarity of Customer 2 to Cluster 1 is calculated as follows:

    Similarity of Item 2 to Cluster 1 = sqrt {{{left( {3 - 2} right)}^2} + {{left( {4 - 2} right)}^2}} = 2.23

    Here’s what the same process looks like with Cluster 2:

    Similarity of Item 2 to Cluster 2 = sqrt {{{left( {3 - 10} right)}^2} + {{left( {4 - 14} right)}^2}} = 12.20

    Comparing these results, you assign Item 2 (that is, Customer 2) to Cluster 1 because the numbers say Item 2 is more similar to Cluster 1.

  3. Apply the same similarity analysis to every other item in the dataset.

    Every time a new member joins a cluster, you must recalculate the cluster representative.

    This depicts the results of the first iteration of K-mean algorithm. Notice that k equals 2, so you’re looking for two clusters, which divides a set of customers into two meaningful groups. Every customer is analyzed separately and is assigned to one of the clusters on the basis of the customer's similarity to each of the current cluster representatives.

  4. Iterate the dataset again, going through every element; compute the similarity between each element and its current cluster representative.

    Notice that Customer 3 has moved from Cluster 1 to Cluster 2. This is because Customer 3’s distance to the cluster representative of Cluster 2 is closer than to the cluster representative of Cluster 1.

    Cluster Representative (Centroid Vector)
    Cluster 1 Customer ID#1 (2, 2)
    Cluster 2 Customer ID#5 (10,14)
    Iteration#1 Customer Cluster 1 Customer Cluster 2
    Customer to be examined Customer IDs belonging to Cluster 1 Cluster Representative Customer IDs belonging to Cluster 1 Cluster Representative
    1 (2, 2) 5 (10, 14)
    2 1, 2 (2.4, 3) 5 (10, 14)
    3 1, 2, 3 (3.6, 4.6) 5 (10, 14)
    4 1, 2, 3 (3.6, 4.6) 4, 5 (8.4, 12)
    6 1, 2, 3 (3.6, 4.6) 4, 5, 6 (8.6, 11.4)
    7 1, 2, 3 (3.6, 4.6) 4, 5, 6, 7 (8.2, 10.8)

Here is a second iteration of K-means algorithm on customer data. Each customer is being re-analyzed. Customer 2 is being assigned to Cluster 1 because Customer 2 is closer to the representative of Cluster 1 than Cluster 2. The same scenario applies to Customer 4. Notice that a cluster representative is being recalculated each time a new member is assigned to a cluster.

Iteration#2 Customer Cluster 1 Customer Cluster 2
Customer to be examined Customer IDs belonging to Cluster 1 Cluster Representative Customer IDs belonging to Cluster 2 Cluster Representative
1 1 (3.6, 4.6) 5 (8.2, 10.8)
2 1, 2 (5.2, 3) 5 (8.2, 10.8)
3 1, 2 (5.2, 3) 5,3 (7.8, 10.2)
4 1, 2 (5.2, 3) 4, 5.3 (7.8, 10.2)
6 1, 2 (5.2, 3) 4, 5, 6.3 (7.8, 10.2)
7 1, 2 (5.2, 3) 3, 4, 5, 6, 7 (7.8, 10.2)

About This Article

This article can be found in the category: