How to Prepare the Data in R Regression for Predictive Analytics - dummies

# How to Prepare the Data in R Regression for Predictive Analytics

You have to get the data into a form that the algorithm can use to build a predictive analytical model. To do so, you have to take some time to understand the data and to know the structure of the data. Type in the function to find out the structure of the data. The command and its output look like this:

```> str(autos)
'data.frame':    398 obs. of 9 variables:
\$ V1: num 18 15 18 16 17 15 14 14 14 15 ...
\$ V2: int 8 8 8 8 8 8 8 8 8 8 ...
\$ V3: num 307 350 318 304 302 429 454 440 455 390 ...
\$ V4: chr "130.0" "165.0" "150.0" "150.0" ...
\$ V5: num 3504 3693 3436 3433 3449 ...
\$ V6: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
\$ V7: int 70 70 70 70 70 70 70 70 70 70 ...
\$ V8: int 1 1 1 1 1 1 1 1 1 1 ...
\$ V9: Factor w/ 305 levels "amc ambassador brougham",..:   50 37 232 15 162 142 55 224 242 2 ...```

From looking at the structure, you can tell that there is some data preparation and cleanup to do. Here’s a list of the needed tasks:

• Rename the column names.

This is not strictly necessary, but for the purposes of this example, it’s better to use column names you can understand and remember.

• Change the data type of V4 (horsepower) to a numeric data type.

In this example, horsepower is a continuous numerical value and not a character data type.

• Handle missing values.

Here horsepower has six missing values.

• Change the attributes that have discrete values to factors.

Here cylinders, model year, and origin have discrete values.

• Discard the V9 (car name) attribute.

Here car name doesn’t add value to the model that you’re creating. If the origin attribute weren’t given, you could have derived the origin from the car name attribute.

To rename the columns type in the following code:

`> colnames(autos) <-   c("mpg","cylinders","displacement","horsepower",  "weight","acceleration","modelYear","origin",  "carName")`

Next, change the data type of horsepower to numeric with the following code:

`> autos\$horsepower <- as.numeric(autos\$horsepower)`

The program will complain because not all the values in horsepower were string representations of numbers. There were some missing values that were represented as the “?” character. That’s fine for now because R converts each instance of ? into NA.

A common way to handle the missing values of continuous variables is to replace each missing value with the mean of the entire column. The following line of code does that:

`> autos\$horsepower[is.na(autos\$horsepower)] <-   mean(autos\$horsepower,na.rm=TRUE)`

It’s important to have na.rm-TRUE in the mean function. It tells the function not to use columns with null values in its computation. Without it, the function will return .

Next, change the attributes with discrete values to factors. Three attributes have been identified as discrete. The following three lines of code change the attributes.

```> autos\$origin <- factor(autos\$origin)
> autos\$modelYear <- factor(autos\$modelYear)
> autos\$cylinders <- factor(autos\$cylinders)```

Finally, remove the attribute from the data frame with this line of code:

`> autos\$carName <- NULL`

At this point, you’ve finished preparing the data for the modeling process. The following is a view of the structure after the data-preparation process:

```> str(autos)
'data.frame': 398 obs. of 8 variables:
\$ mpg   : num 18 15 18 16 17 15 14 14 14 15 ...
\$ cylinders : Factor w/ 5 levels "3","4","5","6",..:   5 5 5 5 5 5 5 5 5 5 ...
\$ displacement: num 307 350 318 304 302 429 454 440 455   390 ...
\$ horsepower : num 130 165 150 150 140 198 220 215 225   190 ...
\$ weight  : num 3504 3693 3436 3433 3449 . . .
\$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5   ...
\$ modelYear : Factor w/ 13 levels "70","71","72",..:   1 1 1 1 1 1 1 1 1 1 ...
\$ origin  : Factor w/ 3 levels "1","2","3":   1 1 1 1 1 1 1 1 1 1 ...```