Searching for Classification by k-Nearest Neighbors for Machine Learning

By Nikhil Abraham

No matter if the machine learning problem is to guess a number or a class, the idea behind the learning strategy of the k-Nearest Neighbors (kNN) algorithm is always the same. The algorithm finds the most similar observations to the one you have to predict and from which you derive a good intuition of the possible answer by averaging the neighboring values, or by picking the most frequent answer class among them.

The learning strategy in a kNN is more like memorization. It’s just like remembering what the answer should be when the question has certain characteristics (based on circumstances or past examples) rather than really knowing the answer, because you understand the question by means of specific classification rules. In a sense, kNN is often defined as a lazy algorithm because no real learning is done at the training time, just data recording.

Being a lazy algorithm implies that kNN is quite fast at training but very slow at predicting. (Most of the searching activities and calculations on the neighbors is done at that time.) It also implies that the algorithm is quite memory-intensive because you have to store your data set in memory (which means that there’s a limit to possible applications when dealing with big data).

Ideally, kNN can make the difference when you’re working on classification and you have many labels to deal with (for instance, when a software agent posts a tag on a social network or when proposing a selling recommendation). kNN can easily deal with hundreds of labels, whereas other learning algorithms have to specify a different model for each label.

Usually, kNN works out the neighbors of an observation after using a measure of distance such as Euclidean (the most common choice) or Manhattan (works better when you have many redundant features in your data). No absolute rules exist concerning what distance measure is best to use. It really depends on the implementation you have. You also have to test each distance as a distinct hypothesis and verify by cross-validation as to which measure works better with the problem you’re solving.