How to Create Subgroups of Data in R
The cut() function in R creates bins of equal size (by default) in your data and then classifies each element into its appropriate bin.
If this sounds like a mouthful, don’t worry. A few examples should make this come to life.
How to use cut to create a fixed number of subgroups
To illustrate the use of cut(), have a look at the built-in dataset state.x77, an array with several columns and one row for each state in the United States:
> head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
You want to work with the column called Frost. To extract this column, try the following:
> frost <- state.x77[, "Frost"]
> head(frost, 5)
Alabama Alaska Arizona Arkansas California
20 152 15 65 20
You now have a new object, frost, a named numeric vector. Now use cut() to create three bins in your data:
> cut(frost, 3, include.lowest=TRUE) [1] [-0.188,62.6] (125,188] [-0.188,62.6] (62.6,125] [5] [-0.188,62.6] (125,188] (125,188] (62.6,125] .... [45] (125,188] (62.6,125] [-0.188,62.6] (62.6,125] [49] (125,188] (125,188] Levels: [-0.188,62.6] (62.6,125] (125,188]
The result is a factor with three levels. The names of the levels seem a bit complicated, but they tell you in mathematical set notation what the boundaries of your bins are. For example, the first bin contains those states that have frost between –0.188 and 62.8 days.
In reality, of course, none of the states will have frost on negative days — R is being mathematically conservative and adds a bit of padding.
Note the argument include.lowest=TRUE to cut(). The default value for this argument is include.lowest=FALSE, which can sometimes cause R to ignore the lowest value in your data.
How to add labels to cut
The level names aren’t very user friendly, so specify some better names with the labels argument:
> cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))
[1] Low High Low Med Low High High Med Low Low Low
....
[45] High Med Low Med High High
Levels: Low Med High
Now you have a factor that classifies states into low, medium, and high, depending on the number of days of frost they get.
How to use table to count the number of observations
One interesting piece of analysis is to count how many states are in each bracket. You can do this with the table() function, which simply counts the number of observations in each level of your factor.
> x <- cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))
> table(x)
x
Low Med High
11 19 20









