How to Create Subgroups of Data in R - dummies

How to Create Subgroups of Data in R

By Andrie de Vries, Joris Meys

The cut() function in R creates bins of equal size (by default) in your data and then classifies each element into its appropriate bin.

If this sounds like a mouthful, don’t worry. A few examples should make this come to life.

How to use cut to create a fixed number of subgroups

To illustrate the use of cut(), have a look at the built-in dataset state.x77, an array with several columns and one row for each state in the United States:

> head(state.x77)
      Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
Alabama     3615  3624    2.1  69.05  15.1  41.3  20 50708
Alaska      365  6315    1.5  69.31  11.3  66.7  152 566432
Arizona     2212  4530    1.8  70.55  7.8  58.1  15 113417
Arkansas     2110  3378    1.9  70.66  10.1  39.9  65 51945
California   21198  5114    1.1  71.71  10.3  62.6  20 156361
Colorado     2541  4884    0.7  72.06  6.8  63.9  166 103766

You want to work with the column called Frost. To extract this column, try the following:

> frost <- state.x77[, "Frost"]
> head(frost, 5)
  Alabama   Alaska  Arizona  Arkansas California
    20    152     15     65     20

You now have a new object, frost, a named numeric vector. Now use cut() to create three bins in your data:

> cut(frost, 3, include.lowest=TRUE)
 [1] [-0.188,62.6] (125,188]   [-0.188,62.6] (62.6,125]
 [5] [-0.188,62.6] (125,188]   (125,188]   (62.6,125]
[45] (125,188]   (62.6,125]  [-0.188,62.6] (62.6,125]
[49] (125,188]   (125,188]
Levels: [-0.188,62.6] (62.6,125] (125,188]

The result is a factor with three levels. The names of the levels seem a bit complicated, but they tell you in mathematical set notation what the boundaries of your bins are. For example, the first bin contains those states that have frost between –0.188 and 62.8 days.

In reality, of course, none of the states will have frost on negative days — R is being mathematically conservative and adds a bit of padding.

Note the argument include.lowest=TRUE to cut(). The default value for this argument is include.lowest=FALSE, which can sometimes cause R to ignore the lowest value in your data.

How to add labels to cut

The level names aren’t very user friendly, so specify some better names with the labels argument:

> cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))
 [1] Low High Low Med Low High High Med Low Low Low
[45] High Med Low Med High High
Levels: Low Med High

Now you have a factor that classifies states into low, medium, and high, depending on the number of days of frost they get.

How to use table to count the number of observations

One interesting piece of analysis is to count how many states are in each bracket. You can do this with the table() function, which simply counts the number of observations in each level of your factor.

> x <- cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))
> table(x)
 Low Med High
 11  19  20