How to Prepare the Data in an R Classification Predictive Analytics Model

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

In order to run a predictive analysis, you have to get the data into a form that the algorithm can use to build a model. To do that, you have to take some time to understand the data and to know its structure. Type in the function to find out the structure of the data. Here’s what it looks like:

> str(seeds)
'data.frame':   210 obs. of 8 variables:
 $ V1: num  15.3 14.9 14.3 13.8 16.1 ...
 $ V2: num  14.8 14.6 14.1 13.9 15 ...
 $ V3: num  0.871 0.881 0.905 0.895 0.903 ...
 $ V4: num  5.76 5.55 5.29 5.32 5.66 ...
 $ V5: num  3.31 3.33 3.34 3.38 3.56 ...
 $ V6: num  2.22 1.02 2.7 2.26 1.35 ...
 $ V7: num  5.22 4.96 4.83 4.8 5.17 ...
 $ V8: int  1 1 1 1 1 1 1 1 1 1 ...

From looking at the structure, you can tell that the data needs one pre-processing step and one convenience step:

  • Rename the column names. This is not strictly necessary, but for the purposes of this example, it’s more convenient to use column names you can understand and remember.

  • Change the attribute with categorical values to a factor. The label has three possible categories.

To rename the columns, type in the following code:

> colnames(seeds) <-   
c("area","perimeter","compactness","length", "width","asymmetry","length2","seedType")

Next, change the attribute that has categorical values to a factor. The following code changes the data type to a factor:

> seeds$seedType <- factor(seeds$seedType)

This command finishes the preparation of the data for the modeling process. The following is a view of the structure after the data-preparation process:

> str(weeds)'data.frame': 210 obs. of 8 variables: $ area  : num 15.3 14.9 14.3 13.8 16.1 ... $ perimeter : num 14.8 14.6 14.1 13.9 15 ... $ compactness: num 0.871 0.881 0.905 0.895 0.903 ... $ length  : num 5.76 5.55 5.29 5.32 5.66 ... $ width  : num 3.31 3.33 3.34 3.38 3.56 ... $ asymmetry : num 2.22 1.02 2.7 2.26 1.35 ... $ length2 : num 5.22 4.96 4.83 4.8 5.17 ... $ seedType : Factor w/ 3 levels "1","2","3":   1 1 1 1 1 1 1 1 1 1 ...