How to Load the Data in an R Classification Predictive Analytics Model
The dataset we analyze to make a prediction on is the Seeds dataset, which can be found at the UCI machine-learning repository. This dataset has 210 observations and 7 attributes plus the label. The label is the expected outcome and is used to train and evaluate the accuracy of the predictive model.
The outcome that you’re trying to predict is the type of seed it is (attribute 8), given the values of the seven attributes. The three possible values for the seed type are labeled 1, 2, and 3, and represent the Kama, Rosa, and Canadian varieties of wheat.
The attributes in the column order they are provided:
length of kernel
width of kernel
length of kernel groove
class of wheat
To get the dataset from the UCI repository and load it into memory, type the following command into the console:
> seeds <-
read.csv("http://archive.ics.uci.edu/ml/machine -learning-databases/00236/seeds_dataset.txt", header=FALSE, sep=", as.is=TRUE)
You see that the dataset was loaded into memory as the data frame variable seeds, by looking at your workspace pane (the top-right). Click the seeds variable to see the data values in the source pane (the top-left). This is how the data looks in the source pane.
You can find more information about the data you just loaded by using the summary() function.
> summary(seeds) V1 V2 V3 Min. :10.59 Min. :12.41 Min. :0.8081 1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569 Median :14.36 Median :14.32 Median :0.8734 Mean :14.85 Mean :14.56 Mean :0.8710 3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878 Max. :21.18 Max. :17.25 Max. :0.9183 ...