 How K-Means Clustering Works for R Programming - dummies

# How K-Means Clustering Works for R Programming

To introduce k-means clustering for R programming, you start by working with the `iris` data frame. This is the iris data frame that’s in the base R installation. Fifty flowers in each of three `iris` species (setosa, versicolor, and virginica) make up the data set. The data frame columns are `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, and `Species`.

For this discussion, you’re concerned with only `Petal.Length`, `Petal.Width`, and `Species`. That way, you can visualize the data in two dimensions.

The image below plots the `iris` data frame with `Petal.Length` on the x-axis, `Petal.Width` on the y-axis, and `Species` as the color of the plotting character.

In k-means clustering, you first specify how many clusters you think the data fall into. In the image below, a reasonable assumption is 3 — the number of species. The next step is to randomly assign each data point (corresponding to a row in the data frame) to a cluster. Then find the central point of each cluster. ML honchos refer to this center as the centroid. The x-value of the centroid is the mean of the x-values of the points in the cluster, and the y-value of the centroid is the mean of the y-values of the points in the cluster.

It’s also possible to calculate a centroid for the entire set of observations. Its x-coordinate is the average of every data point’s x-coordinate (`Petal.Length`, in this example), and its y-coordinate is the average of every data point’s y-coordinate (`Petal.Width`, in this example). The sum of squared distances from each point to this overall centroid is called the total sum of squares. The sum of squared distances from each cluster centroid to the overall centroid is the between sum of squares.