How to Download a UCI Dataset for R Programming
Many (but not all) of the UCI datasets you will use in R programming are in comma-separated value (CSV) format: The data are in text files with a comma between successive values. A typical line in this kind of file looks like this:
This is the first line from a well-known dataset called
iris. The rows are measurements of 150 iris flowers — 50 each of three species of iris. The species are called setosa, versicolor, and virginica. The data are sepal length, sepal width, petal length, petal width, and species. One typical ML project is to develop a mechanism that can learn to use an individual flower’s measurements to identify that flower’s species.
What’s a sepal? On a plant that’s in bloom, a sepal supports a petal. On an iris, sepals look something like larger petals underneath the actual petals. In that first line of the dataset, notice that the first two values (sepal length and width) are larger than the second two (petal length and width).
You can find
iris in numerous places, including the
datasets package in base R. The point of this exercise, however, is to show you how to get and use a dataset from UCI.
Go to the UCI ML repository to retrieve the data.
Click on the Data Set Description link. This opens a page of valuable information about the data set, including source material, publications that use the data, column names, and more. In this case, this page is particularly valuable because it tells you about some errors in the data.
Returning to the previous page, click on the Data Folder link. On the page that opens, click the
iris.data link. This opens the page that holds the dataset in CSV format.
To download the dataset, you use the
read.csv() function. you can do this in several ways. To accomplish everything at once — to use just one function to read the file into R as a dataframe complete with column names — use this code:
iris.uci <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header=FALSE, col.names = ("sepal.length","sepal.width","petal.length","petal.width", "species"))
The first argument is the web address of the dataset. The second indicates that the first row of the dataset is a row of data and does not provide the names of the columns. The third argument is a vector that assigns the column names. The column names come from the Data Set Description web page. That page gives
class as the name for the last column, but it seems that
species is correct. (And that’s the name in the
iris dataset in the
If you think that’s a little too much to put in one function, here’s another way:
iris.uci <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning- databases/iris/iris.data"), header=FALSE)
You can do this still another way. With the dataset web page open, you press Ctrl+A to select everything on the page, and you press Ctrl+C to put all the data on the clipboard. Then
iris.uci <- read.csv("clipboard", header=FALSE, col.names= c("sepal.length","sepal.width","petal.length","petal.width","species"))
gets the job done. This way, you don’t have to deal with the web address.