Nearest Neighbor Data Analysis

By Lillian Pierson

At its core, the purpose of a nearest neighbor analysis is to search for and locate either a nearest point in space or nearest numerical value, depending on the attribute you use for the basis of comparison.

Since the nearest neighbor technique is a classification method, you can use it to do things as scientific as deducing the molecular structure of a vital human protein or uncovering key biological evolutionary relationships, and as business-driven as designing recommendation engines for e-commerce sites or building predictive models for consumer transactions. The applications are limitless.

A good analogy for the nearest neighbor analysis concept is illustrated in GPS technology. Imagine you’re in desperate need of a Starbucks iced latte, but you have no idea where the nearest Starbucks is located. What do you do? One easy solution is simply to ask your smartphone where the nearest Starbucks is located.

When you do that, the system looks for businesses named Starbucks within a reasonable proximity of your current location. After generating a results listing, the system reports back to you with the address of the Starbucks coffeehouse closest to your current location — the Starbucks that is your nearest neighbor, in other words.

As the term nearest neighbor implies, the primary purpose of a nearest ­neighbor analysis is to examine your dataset and find the data point that’s quantitatively most similar to your observation data point. Note that similarity comparisons can be based on any quantitative attribute, whether that be distance, age, income, weight, or anything else that can describe the data point you’re investigating. The simplest comparative attribute is distance.

In the above Starbucks analogy, the x, y, z coordinates of the Starbucks reported to you by your smartphone are the most similar to the x, y, z coordinates of your current location. In other words, its location is closest in actual physical distance. The quantitative attribute being compared is distance, your current location is the observation data point, and the reported Starbucks coffeehouse is the most similar feature.

Modern nearest neighbor analyses are almost always performed using computational algorithms. The nearest neighbor algorithm is known as a single-link algorithm — an algorithm that merges clusters if the clusters share at least one connective edge (a shared boundary line, in other words) between them.