How to Count Unique Data Values in R
To figure out what data can be factored when working in R, let’s take a look at the dataset mtcars. This built-in dataset describes fuel consumption and ten different design points from 32 cars from the 1970s. It contains, in total, 11 variables, but all of them are numeric.
Although you can work with the data frame as is, some variables could be converted to a factor because they have a limited amount of values.
If you don’t know how many different values a variable has, you can get this information in two simple steps:
Get the unique values of the variable using unique().
Get the length of the resulting vector using length().
Using the sapply() function, you can do this for the whole data frame at once. You apply an anonymous function combining both mentioned steps on the whole data frame, like this:
> sapply(mtcars, function(x) length(unique(x))) mpg cyl disp hp drat wt qsec vs am gear carb 25 3 27 22 22 29 30 2 2 3 6
So, it looks like the variables cyl, vs, am, gear, and carb can benefit from a conversion to factor.
You have 32 different observations in that dataset, so none of the variables has unique values only.
When to treat a variable like a factor depends a bit on the situation, but, as a general rule, avoid more than ten different levels in a factor and try to have at least five values per level.