How to Introduce the Data in R Regression for Predictive Analytics
The dataset you will use in this example is the Auto-MPG dataset, which can be found in the UCI repository. This dataset has 398 observations and 8 attributes plus the label.
The label is the expected outcome; it’s used to train and evaluate the accuracy of the predictive model. The outcome that we’re trying to predict is the expected mpg (attribute 1) of an automobile when given the values of the eight attributes.
Here are the attributes in the column order in which they are provided:
To get the dataset from the UCI repository and load it into memory, type the following command into the console:
> autos <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
header=FALSE, sep=", as.is=TRUE)
You’ll see that the dataset was loaded into memory as the data frame variable autos, by looking at your workspace pane (the top-right pane). Click the autos variable to see the data values in the source pane (the top-left pane).
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Using the head and tail functions can come in handy sometimes if you just want to see the first and last five rows of the data. This is also a quick way to verify that you actually loaded the correct file and it was read correctly. The function can give you basic statistics on each column of the data.
You can copy and paste the following three lines of code into the source pane and have the output shown in the console:
head(autos,5) tail(autos,5) summary(autos)