How to Load the Data in an R Classification Predictive Analytics Model

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

The dataset we analyze to make a prediction on is the Seeds dataset, which can be found at the UCI machine-learning repository. This dataset has 210 observations and 7 attributes plus the label. The label is the expected outcome and is used to train and evaluate the accuracy of the predictive model.

The outcome that you’re trying to predict is the type of seed it is (attribute 8), given the values of the seven attributes. The three possible values for the seed type are labeled 1, 2, and 3, and represent the Kama, Rosa, and Canadian varieties of wheat.

The attributes in the column order they are provided:

  1. area

  2. perimeter

  3. compactness

  4. length of kernel

  5. width of kernel

  6. asymmetry coefficient

  7. length of kernel groove

  8. class of wheat

To get the dataset from the UCI repository and load it into memory, type the following command into the console:

> seeds <-   
read.csv("http://archive.ics.uci.edu/ml/machine -learning-databases/00236/seeds_dataset.txt", header=FALSE, sep=", as.is=TRUE)

You see that the dataset was loaded into memory as the data frame variable seeds, by looking at your workspace pane (the top-right). Click the seeds variable to see the data values in the source pane (the top-left). This is how the data looks in the source pane.

image0.jpg

You can find more information about the data you just loaded by using the summary() function.

> summary(seeds)  
      V1              V2              V3   
Min.   :10.59   Min.   :12.41   Min.   :0.8081 
1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569 
Median :14.36   Median :14.32   Median :0.8734 
Mean   :14.85   Mean   :14.56   Mean   :0.8710 
3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878 
Max.   :21.18   Max.   :17.25   Max.   :0.9183 
...