How to Cluster by Nearest Neighbors in Predictive Analysis - dummies

# How to Cluster by Nearest Neighbors in Predictive Analysis

Nearest Neighbors is a simple algorithm widely used in predictive analysis to cluster data by assigning an item to a cluster by determining what other items are most similar to it. A typical use of the Nearest Neighbors algorithm follows these steps:

1. Derive a similarity matrix from the items in the dataset.

This matrix, referred to as the distance matrix, will hold the similarity values for each and every item in the data set. (These values are elaborated in detail in the next example.)

2. With the matrix in place, compare each item in the dataset to every other item and compute the similarity value.

3. Using the distance matrix, examine every item to see whether the distance to its neighbors is less than a value that you have defined.

This value is called the threshold.

4. The algorithm puts each element in a separate cluster, analyzes the items, and decides which items are similar, and adds similar items to the same cluster.

5. The algorithm stops when all items have been examined.

Consider, a dataset of eight geographical locations where individuals live. The purpose is to divide these individuals into groups based on their geographical locations, as determined by the Global Positioning System.

This chart shows a simple dataset of individuals’ geographic data. Assume that all the data collected about these eight individuals was collected at a specific point in time.

Individual ID GPS – Geographical Longitude GPS – Geographical Latitude
1 2 10
2 2 5
3 8 4
4 5 8
5 7 5
6 6 4
7 1 2
8 4 9

As with K-means, the first pre-step is to calculate the similarity values for every pair of individuals. One way to calculate a similarity between two items is to determine the Euclidean distance. The similarity value between two points is calculated as shown earlier.

Similarity between Item A and Item B =

√ (fa,1 – fb,1) 2 + (fa,2 – fb,2) 2+ …+ (fa,n – fb,n) 2

Here fa,1 is the first feature of Item A, fa,2 is the second feature of Item A, and corresponding values labeled b represent the features of Item B. The variable n is the number of features. In this example, n is 2. For instance, the similarity between Item 1 and Item 2 is calculated as follows:

Similarity between Item 1 and Item 2 = √(2-2)2 +(10-5) 2 = 5

On the basis of this measurement of similarity between items, you can use the Nearest Neighbor algorithm to extract clusters from the dataset of geographical locations.

The first step is to place the individual whose ID is 1, longitude is 2, and latitude is 10 in cluster C1. Then go through all remaining individuals computing how similar each one is to the individual in C1.

If the similarity between Individual 1 and another Individual x is less than 4.5, then Individual x will join C1; otherwise you create a new cluster to accommodate Individual x.

The following shows the similarities and numerical relationships between Individuals 1 through 8. The similarity of these data elements is calculated as a Euclidean distance.

Individuals with similarity values closer to 0 have greater similarity. Half the matrix is not filled because the matrix is symmetric.

 Individual #1 Individual #2 Individual #3 Individual #4 Individual #5 Individual #6 Individual #7 Individual #8 Individual #1 0 5 6 3.6 7.07 7.21 8.062 2.23 Individual #2 0 6.8 4.24 5 4.12 3.16 4.47 Individual #3 0 5 1.41 1.41 7.28 6.40 Individual #4 0 3.31 4.12 7.21 1.41 Individual #5 0 1.41 6.70 5 Individual #6 0 5.38 5.38 Individual #7 0 7.61 Individual #8 0

You have now assigned Individual 1 to the first cluster (C1). The similarity between Individual 1 and Individual 2 is equal to 5, which is greater than the threshold value 4.5. A new cluster is generated — and Individual 2 belongs to it. At this stage, you have two clusters of one item each: C1 = {Individual 1} and C2 = {Individual 2}.

Moving the focus to Individual 3, you find that the similarity between Individual 3 and Individual 2 & 1 is larger than the threshold value 4.5. Thus you assign Individual 3 to a new cluster containing one item: C3 = {Individual 3}.

Moving to Individual 4, you calculate how similar Individual 4 is to Individuals 1, 2, and 3. The nearest (most similar) to Individual 4 happens to be Individual 1. The similarity between 4 and 1 is approximately 3.6, which is less than the threshold value 4.5.

Individual 4 joins Individual 1 in Cluster C1.

Next is to examine Individual 5 and calculate how similar it is to Individuals 1, 2, 3, and 4. The item nearest in distance (most similar) to Individual 5 is Individual 3. The similarity is √2, which is less than the threshold value of 4.5. Thus Individual 5 joins C3.

When you examine Individual 6 and calculate how similar it is to Individuals 1, 2, 3, 4, and 5, you discover that Individual 3 is nearest (most similar) to Individual 6. Thus Individual 6 joins C3.

When you examine Individual 7 and calculate how similar it is to Individuals 1, 2, 3, 4, 5, and 6, you find that the nearest (most similar) item to Individual 7 is Individual 2. Thus Individual 7 joins C2.

When you examine Individual 8, and calculate its similarity to Individuals 1, 2, 3, 4, and 5, you find that the nearest (most similar) item to Individual 8 is Individual 4. Thus Individual 8 joins C1.

The clusters constructed so far, containing items most similar to each other, are

C1 = {Individual 1, Individual 4, Individual 8}

C2 = {Individual 2, Individual 7}

C3 = {Individual 3, Individual 5, Individual 6}