How to Work with Factors and Numeric Vectors in R Models and Calculations - dummies

How to Work with Factors and Numeric Vectors in R Models and Calculations

If you work with factors in R that have numeric values as levels, you have to be extra careful when using these factors in models and other calculations. For example, you convert the number of cylinders in the built-in dataset mtcars to a factor like this:

> cyl.factor <- as.factor(mtcars$cyl)

If you want to know the median number of cylinders, you may be tempted to do the following:

> median(as.numeric(cyl.factor))
[1] 2

This result is bogus, because the minimum number of cylinders is four. R converts the internal representation of the factor to numbers, not the labels. So, you get numbers starting from one to the number of levels instead of the original values.

To correctly transform a factor its original numeric values, you can first transform the factor to character and then to numeric. But on very big data, this is done faster with the following construct:

> as.numeric(levels(cyl.factor))[cyl.factor]

With this code, you create a short vector with the levels as numeric values, and then use the internal integer representation of the factor to select the correct value.

Although R often converts a numeric vector to a factor automatically when necessary, it doesn’t do so if both numeric vectors and factors can be used. If you want to model, for example, the mileage of a car to the number of cylinders, you get a different model when you use the number of cylinders as a numeric vector or as a factor.

The interpretation of both models is completely different, and a lot depends on what exactly you want to do. But you have to be aware of that, or you may be interpreting the wrong model.