R Project: Identifying Mushrooms

By Joseph Schmuller

Give this project a try to test out your R skills. If you’re the outdoorsy type, you probably encounter mushrooms growing in the wild. As you might know, some mushrooms are edible, and others are most definitely not(!)

The UCI ML repository has a dataset of mushrooms with lots and lots of instances (8,124 of them) and 22 attributes. The target variable indicates whether the mushroom is edible (e) or poisonous (p).

You create an R data frame by navigating to the Data Folder, finding the .csv data file, and then pressing Ctrl+A to select all data and Ctrl+C to copy it to the clipboard. Then this line does the trick:

mushroom.uci <- read.csv("clipboard", header=FALSE)

A word of advice: The attribute names are long and involved, so for this project only, don’t bother naming the columns unless you really and truly want to. Instead, use the default V1, V2, and so on that R provides. Also, and this is important, after you put the data into Rattle, you’ll see that Rattle makes a guess about the target variable. Its guess, V23, is wrong. The real target variable is V1. So click the appropriate radio buttons to make the changes.

Finally, unlike the datasets you may have used before, this one has missing values. They’re all in V12 (2,480 of them), denoted by a question mark. To deal with this, select the Rattle Transform tab and click the radio button for Impute and the radio button for Zero/Missing. Click V12 and then Execute. This substitutes Missing for the question mark. (Spoiler alert: With this data frame, it doesn’t make much difference whether you do this or not.)

When you create the forest, you should have a confusion matrix with just two rows and two columns. You’ll be pleasantly surprised by the OOB error rate!