Estimating Data Clusters with Kernel Density Estimation

By Lillian Pierson

One way to identify clusters in your data is to use a density smoothing function. Kernel density estimation (KDE) is just such a smoothing method; it works by placing a kernel — a weighting function that is useful for quantifying density — on each data point in the data set and then summing the kernels to generate a kernel density estimate for the overall region.

Areas of greater point density will sum out with greater kernel density, while areas of lower point density will sum out with less kernel density.

Because kernel smoothing methods don’t rely on cluster center placement and clustering techniques to estimate clusters, they don’t exhibit a risk of generating erroneous clusters by placing centers in areas of local minimum density.

Where k-means algorithms generate hard-lined definitions between points in different clusters, KDE generates a plot of gradual density change between data points. For this reason, it’s a helpful aid when eyeballing clusters. The following figure shows what the World Bank Income and Education scatter plot looks like after a KDE has been applied.

image0.jpg

You can see that the white spaces between clusters have been reduced. Looking at the figure, it’s fairly obvious that there are at least three clusters, and possibly more, if you want to allow for small clusters.