How to Use Factors or Numeric Data in R - dummies

How to Use Factors or Numeric Data in R

By Andrie de Vries, Joris Meys

Before you attempt to describe your data in R, you have to make sure your data is in the right format. This means

  • Making sure all your data is contained in a data frame (or in a vector if it’s a single variable)

  • Ensuring that all the variables are of the correct type

  • Checking that the values are all processed correctly

Some data can have only a limited number of different values. For example, people can be either male or female, and you can describe most hair types with only a few colors.

Sometimes more values are theoretically possible but not realistic. For example, cars can have more than 16 cylinders in their engines, but you won’t find many of them. In one way or another, all this data can be seen as categorical. By this definition, categorical data also includes ordinal data.

On the other hand, you have data that can have an unlimited amount of possible values. This doesn’t necessarily mean that the values can be any value you like. For example, the mileage of a car is expressed in miles per gallon, often rounded to the whole mile. Yet, the real value will be slightly different for every car.

The only thing that defines how many possible values you allow is the precision with which you express the data. Data that can be expressed with any chosen level of precision is continuous. Both interval-scaled data and ratio-scaled data are usually continuous data.

The distinction between categorical and continuous data isn’t always clear though. Age is, in essence, a continuous variable, but it’s often expressed in the number of years since birth.

You still have a lot of possible values if you do that, but what happens if you look at the age of the kids at your local high school? Suddenly you have only five, maybe six, different values in your data. At that point, you may get more out of your analysis if you treat that data as categorical.

When describing your data, you need to make the distinction between data that benefits from being converted to a factor and data that needs to stay numeric. If you can view your data as categorical, converting it to a factor helps with analyzing it.