Similarity Metrics Used in Data Science

By Lillian Pierson

Both clustering and classification are based on calculating the similarity or difference between two data points. If your dataset is numeric — comprised of only number fields and values — and can be portrayed on an n-dimensional plot, then there are various geometric metrics you can use to scale your multidimensional data.

An n-dimensional plot is a multidimensional scatter plot chart that you can use to plot n number of dimensions of data.

Some popular geometric metrics used for calculating distances between data points include Euclidean, Manhattan, or Minkowski distance metrics. These metrics are just different geometric functions that are useful for modeling distances between points. The Euclidean metric is a measure of the distance between points plotted on a Euclidean plane.

The Manhattan metric is a measure of the distance between points where distance is calculated as the sum of the absolute value of the differences between two point’s Cartesian coordinates. The Minkowski distance metric is a generalization of the Euclidean and Manhattan distance metrics. Quite often, these metrics can be used ­interchangeably.

If your data is numeric but non-plottable (such as curves instead of points), you can generate similarity scores based on differences between data, instead of the actual values of the data itself.

Lastly, for non-numeric data, you can use metrics like the Jaccard distance metric, which is an index that compares the number of features that two data points have in common. For example, to illustrate a Jaccard distance, think about the two following text strings: Saint Louis de Ha-ha, Quebec and St-Louis de Ha!Ha!, QC.

What features do these text strings have in common? And what features are different between them? The Jaccard metric generates a numerical index value that quantifies the similarity between text strings.