Data Science: Performing Hierarchical Clustering with Python

By John Paul Mueller, Luca Massaron

You can use Python to perform hierarchical clustering in data science. If the K-means algorithm is concerned with centroids, hierarchical (also known as agglomerative) clustering tries to link each data point, by a distance measure, to its nearest neighbor, creating a cluster. Reiterating the algorithm using different linkage methods, the algorithm gathers all the available points into a rapidly diminishing number of clusters, until in the end all the points reunite into a single group.

The results, if visualized, will closely resemble the biological classifications of living beings that you may have studied in school, an upside-down tree whose branches are all converging into a trunk.

Such a figurative tree is a dendrogram, and you see it used in medical and biological research. Scikit-learn implementation of agglomerative clustering does not offer the possibility of depicting a dendrogram from your data because such a visualization technique works fine with only a few cases, whereas you can expect to work on many examples.

Compared to K-means, agglomerative algorithms are more cumbersome and do not scale well to large datasets. Agglomerative algorithms are more suitable for statistical studies. These algorithms do offer the advantage of creating a complete range of nested cluster solutions.

To use agglomerative clustering effectively, you have to know about the different linkage methods and the distance metrics. There are three linkage methods:

  • Ward: Tends to look for spherical clusters, very cohesive inside and extremely differentiated from other groups. Another nice characteristic is that the method tends to find clusters of similar size. It works only with the Euclidean distance.

  • Complete: Links clusters using their furthest observations, that is, their most dissimilar data points. Consequently, clusters created using this method tend to be comprised of highly similar observations, making the resulting groups quite compact.

  • Average: Links clusters using their centroids and ignoring their boundaries. The method creates larger groups than the complete method. In addition, the clusters can be different sizes and shapes, contrary to the Ward’s solutions. Consequently, this average, multipurpose, approach sees successful use in the field of biological sciences.

There are also three distance metrics:

  • Euclidean (euclidean or l2): As seen in K-means

  • Manhattan (manhattan or l1): Similar to Euclidean, but the distance is calculated by summing the absolute value of the difference between the dimensions. In a map, if the Euclidean distance is the shortest route between two points, the Manhattan distance implies moving straight, first along one axis and then along the other — as a car in the city would, reaching a destination by driving along city blocks.

  • Cosine (cosine): A good choice when there are too many variables and you worry that some variable may not be significant. Cosine distance reduces noise by taking the shape of the variables, more than their values, into account. It tends to associate observations that have the same maximum and minimum variables, regardless of their effective value.

If your dataset doesn’t contain too many observations, it’s worth trying agglomerative clustering with all the combinations of linkage and distance and then comparing the results carefully. In clustering, you rarely already know right answers, and agglomerative clustering can provide you with another useful potential solution. For example, you can recreate the previous analysis with K-means and handwritten digits, using the ward linkage and the Euclidean distance as follows:

from sklearn.cluster import AgglomerativeClustering
# Affinity = {“euclidean”, “l1”, “l2”, “manhattan”,
# “cosine”}
# Linkage = {“ward”, “complete”, “average”}
Hclustering = AgglomerativeClustering(n_clusters=10,
 affinity=‘euclidean’, linkage=‘ward’)
Hclustering.fit(Cx)
ms = np.column_stack((ground_truth,Hclustering.labels_))
df = pd.DataFrame(ms,
 columns = [‘Ground truth’,’Clusters’])
pd.crosstab(df[‘Ground truth’], df[‘Clusters’],
 margins=True)

The results, in this case, are comparable to K-means, although, you may have noticed that completing the analysis using this approach takes longer than using K-means. When working with a large number of observations, the computations for a hierarchical cluster solution may take hours to complete, making this solution less feasible. You can get around the time issue by using two-phase clustering, which is faster and provides you with a hierarchical solution even when you are working with large datasets.

To implement the two-phase clustering solution, you process the original observations using K-means with a large number of clusters. A good rule of thumb is to take the square root of the number of observations and use that figure, but you always have to keep the number of clusters in the range of 100–200 for the second phase, based on hierarchical clustering, to work well. The following example uses 100 clusters.

from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=100, n_init=10,
 random_state=1)
clustering.fit(Cx)

At this point, the tricky part is to keep track of what case has been assigned to what cluster derived from K-means. You can use a dictionary for such a purpose.

Kx = clustering.cluster_centers_
Kx_mapping = {case:cluster for case,
 cluster in enumerate(clustering.labels_)}

The new dataset is Kx, which is made up of the cluster centroids that the K-means algorithm has discovered. You can think of each cluster as a well-represented summary of the original data. If you cluster the summary now, it will be almost the same as clustering the original data.

from sklearn.cluster import AgglomerativeClustering
Hclustering = AgglomerativeClustering(n_clusters=10,
 affinity=‘cosine’, linkage=‘complete’)
Hclustering.fit(Kx)

You now map the results to the centroids you originally used so that you can easily determine whether a hierarchical cluster is made of certain K-means centroids. The result consists of the observations making up the K-means clusters having those centroids.

H_mapping = {case:cluster for case,
 cluster in enumerate(Hclustering.labels_)}
final_mapping = {case:H_mapping[Kx_mapping[case]]
 for case in Kx_mapping}

Now you can evaluate the solution you obtained using a similar confusion matrix as you did before for both K-means and hierarchical clustering.

ms = np.column_stack((ground_truth,
 [final_mapping[n] for n in range(max(final_mapping)+1)]))
df = pd.DataFrame(ms,
 columns = [‘Ground truth’,’Clusters’])
pd.crosstab(df[‘Ground truth’], df[‘Clusters’],
 margins=True)

The result proves that this approach is a viable method for handling large datasets or even big data datasets, reducing them to a smaller representations and then operating with less scalable clustering, but more varied and precise techniques.