# R Projects For Dummies

**Published: **02-13-2018

**Make the most of R’s extensive toolset**

*R Projects For Dummies* offers a unique learn-by-doing approach. You will increase the depth and breadth of your R skillset by completing a wide variety of projects. By using R’s graphics, interactive, and machine learning tools, you’ll learn to apply R’s extensive capabilities in an array of scenarios. The depth of the project experience is unmatched by any other content online or in print. And you just might increase your statistics knowledge along the way, too!

R is a free tool, and it’s the basis of a huge amount of work in data science. It's taking the place of costly statistical software that sometimes takes a long time to learn. One reason is that you can use just a few R commands to create sophisticated analyses. Another is that easy-to-learn R graphics enable you make the results of those analyses available to a wide audience.

This book will help you sharpen your skills by applying them in the context of projects with R, including dashboards, image processing, data reduction, mapping, and more.

- Appropriate for R users at all levels
- Helps R programmers plan and complete their own projects
- Focuses on R functions and packages
- Shows how to carry out complex analyses by just entering a few commands

If you’re brand new to R or just want to brush up on your skills, *R Projects For Dummies* will help you complete your projects with ease.

## Articles From R Projects For Dummies

### Filter Results

Cheat Sheet / Updated 05-02-2022

To complete any project using R, you work with functions that live in packages designed for specific areas. This cheat sheet provides some information about these functions.

View Cheat SheetArticle / Updated 04-11-2018

If you’ve been working with images, animated images, and combined stationary images in R, it may be time to take the next step. This project walks you through the next step: Combine an image with an animated image. This image shows the end product — the plot of the iris data set with comedy icons Laurel and Hardy positioned in front of the plot legend. When you open this combined image in the Viewer, you see Stan and Ollie dancing their little derbies off. (The derbies don’t actually come off in the animation, but you get the drift.) Getting Stan and Ollie Check out the Laurel and Hardy GIF. Right-click the image and select Save Image As from the pop-up menu that appears. Save it as animated-dancing-image-0243 in your Documents folder. Then read it into R: l_and_h <- image_read("animated-dancing-image-0243.gif") Applying the length() function to l_and_h > length(l_and_h) [1] 10 indicates that this GIF consists of ten frames. To add a coolness factor, make the background of the GIF transparent before image_read() works with it. This free online image editor does the job quite nicely. Combining the boys with the background If you use the image combination technique, the code looks like this: image_composite(image=background, composite_image=l_and_h, offset = "+510+200") The picture it produces looks like the image above but with one problem: The boys aren’t dancing. Why is that? The reason is that image_composite() combined the background with just the first frame of l_and_h, not with all ten. It’s exactly the same as if you had run image_composite(image=background, composite_image=l_and_h[1], offset = "+510+200") The length() function verifies this: > length(image_composite(image=background, composite_image=l_and_h, offset = "+510+200")) [1] 1 If all ten frames were involved, the length() function would have returned 10. To get this done properly, you have to use a magick function called image_apply(). Explaining image_apply() So that you fully understand how this important function works, let's describe an analogous function called lapply(). If you want to apply a function (like mean()) to the variables of a data frame, like iris, one way to do that is with a for loop: Start with the first column and calculate its mean, go to the next column and calculate its mean, and so on until you calculate all the column means. For technical reasons, it’s faster and more efficient to use lapply() to apply mean() to all the variables: > lapply(iris, mean) $Sepal.Length [1] 5.843333 $Sepal.Width [1] 3.057333 $Petal.Length [1] 3.758 $Petal.Width [1] 1.199333 $Species [1] NA A warning message comes with that last one, but that’s okay. Another way to write lapply(iris, mean) is lapply(iris, function(x){mean(x)}). This second way comes in handy when the function becomes more complicated. If, for some reason, you want to square the value of each score in the data set and then multiply the result by three, and then calculate the mean of each column, here’s how to code it: lapply(iris, function(x){mean(3*(x^2))}) In a similar way, image_apply() applies a function to every frame in an animated GIF. In this project, the function that gets applied to every frame is image_composite(): function(frame){image_composite(image=background, composite_image=frame, offset = "+510+200")} So, within image_apply(), that’s frames <- image_apply(image=l_and_h, function(frame) { image_composite(image=background, composite_image=frame, offset = "+510+200") }) After you run that code, length(frames) verifies the ten frames: > length(frames) [1] 10 Getting back to the animation The image_animate() function puts it all in motion at ten frames per second: animation <- image_animate(frames, fps = 10) To put the show on the screen, it’s print(animation) All together now: l_and_h <- image_read("animated-dancing-image-0243.gif") background <- image_background(iris_plot, "white) frames <- image_apply(image=l_and_h, function(frame) { image_composite(image=background, composite_image=frame, offset = "+510+200") }) animation <- image_animate(frames, fps = 10) print(animation) And that’s the code for the image above. One more thing. The image_write() function saves the animation as a handy little reusable GIF: image_write(animation, "LHirises.gif")

View ArticleArticle / Updated 04-11-2018

Here, you learn about books and websites that help you learn more about R programming. Without further ado. . . Interacting with users If you want to delve deeper into R applications that interact with users, start with this tutorial by shiny guiding force Garrett Grolemund. For a helpful book on the subject, consider Chris Beeley’s web Application Development with R Using Shiny, 2nd Edition (Packt Publishing, 2016). Machine learning For the lowdown on all things Rattle, go directly to the source: Rattle creator Graham Williams has written Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Springer, 2011). Check out the companion website. The University of California-Irvine Machine Learning Repository plays such a huge role in the R programming world. Here’s how its creator prefers that you look for the material: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Thank you, UCI Anteaters! If machine learning interests you, take a comprehensive look at the field (under its other name, “statistical learning”): Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani’s An Introduction to Statistical Learning with Applications in R (Springer, 2017). An Introduction to Neural Networks, by Ben Krose and Patrick van der Smagt, is a little dated, but you can get it for the low, low price of nothing: After you download a large PDF, it’s a good idea to upload it into an ebook app, like Google Play Books. That turns the PDF into an ebook and makes it easier to navigate on a tablet. Databases The R-bloggers website has a nice article on working with databases. Of course, R-bloggers has terrific articles on a lot of R-related topics! You can learn quite a bit about RFM (Recency Frequency Money) analysis and customer segmentation at www.putler.com/rfm-analysis. Maps and images The area of maps is a fascinating one. You might be interested in something at a higher level. If so, read Introduction to visualising spatial data in R by Robin Lovelace, James Cheshire, Rachel Oldroyd (and others). David Kahle and Hadley Wickham’s ggmap: Spatial Visualization with ggplot2 is also at a higher level. Fascinated by magick? The best place to go is the primary source. Check it out.

View ArticleArticle / Updated 04-11-2018

Try out this R project to see how one variable might affect an outcome. It’s conceivable that weather conditions could influence flight delays. How do you incorporate weather information into the assessment of delay? One nycflights13 data frame called weather provides the weather data for every day and hour at each of the three origin airports. Here’s a glimpse of exactly what it has: > glimpse(weather,60) Observations: 26,130 Variables: 15 $ origin "EWR", "EWR", "EWR", "EWR", "EWR", "... $ year 2013, 2013, 2013, 2013, 2013, 2013, ... $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ day 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ hour 0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 1... $ temp 37.04, 37.04, 37.94, 37.94, 37.94, 3... $ dewp 21.92, 21.92, 21.92, 23.00, 24.08, 2... $ humid 53.97, 53.97, 52.09, 54.51, 57.04, 5... $ wind_dir 230, 230, 230, 230, 240, 270, 250, 2... $ wind_speed 10.35702, 13.80936, 12.65858, 13.809... $ wind_gust 11.918651, 15.891535, 14.567241, 15.... $ precip 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ pressure 1013.9, 1013.0, 1012.6, 1012.7, 1012... $ visib 10, 10, 10, 10, 10, 10, 10, 10, 10, ... $ time_hour 2012-12-31 19:00:00, 2012-12-31 20:... So the variables it has in common with flites_name_day are the first six and the last one. To join the two data frames, use this code: flites_day_weather <- flites_day %>% inner_join(weather, by = c("origin","year","month","day","hour","time_hour")) Now you can use flites_day_weather to start answering questions about departure delay and the weather. What questions will you ask? How will you answer them? What plots will you draw? What regression lines will you create? Will scale() help? And, when you’re all done, take a look at arrival delay (arr_delay).

View ArticleArticle / Updated 04-11-2018

If you’re interested in trying out your RFM analysis skills on another set of data, this R project is for you. The CDNOW data set consists of almost 70,000 rows. It’s a record of sales at CDNOW from the beginning of January 1997 through the end of June 1998. Press Ctrl+A to highlight all the data, and press Ctrl+C to copy to the clipboard. Then use the read.csv() function to read the data into R: cdNOW <- read.csv("clipboard", header=FALSE, sep = "") Here’s how to name the columns: colnames(cdNOW) <- c("CustomerID","InvoiceDate","Quantity","Amount") The data should look like this: > head(cdNOW) CustomerID InvoiceDate Quantity Amount 1 1 19970101 1 11.77 2 2 19970112 1 12.00 3 2 19970112 5 77.00 4 3 19970102 2 20.76 5 3 19970330 2 20.76 6 3 19970402 2 19.54 It’s less complicated than the Online Retail project because Amount is the total amount of the transaction. So each row is a transaction, and aggregation is not necessary. The Quantity column is irrelevant for our purposes. Here’s a hint about reformatting the InvoiceDate: The easiest way to get it into R date format is to download and install the lubridate package and use its ymd() function: cdNOW$InvoiceDate <-ymd(cdNOW$InvoiceDate) After that change, here’s how the first six rows look: > head(cdNOW) CustomerID InvoiceDate Quantity Amount 1 1 1997-01-01 1 11.77 2 2 1997-01-12 1 12.00 3 2 1997-01-12 5 77.00 4 3 1997-01-02 2 20.76 5 3 1997-03-30 2 20.76 6 3 1997-04-02 2 19.54 Almost there. What’s missing for findRFM()? An invoice number. So you have to use a little trick to make one up. The trick is to use each row identifier in the row-identifier column as the invoice number. To turn the row-identifier column into a data frame column, download and install the tibble package and use its rownames_to_column() function: cdNOW <- rownames_to_column(cdNOW, "InvoiceNumber") Here’s the data: > head(cdNOW) InvoiceNumber CustomerID InvoiceDate Quantity Amount 1 1 1 1997-01-01 1 11.77 2 2 2 1997-01-12 1 12.00 3 3 2 1997-01-12 5 77.00 4 4 3 1997-01-02 2 20.76 5 5 3 1997-03-30 2 20.76 6 6 3 1997-04-02 2 19.54 Now create a data frame with everything but that Quantity column and you’re ready. See how much of the Online Retail project you can accomplish in this one. Happy analyzing!

View ArticleArticle / Updated 04-11-2018

One benefit of Rattle is that it allows you to easily experiment with whatever it helps you create with R. Here’s a little project for you to try. You’ll learn more about neural networks if you can see how the network error rate decreases with the number of iterations through the training set. So the objective is to plot the error rate for the banknote.uci network as a function of the number of iterations through the training data. You should expect to see a decline as the number of iterations increases. The measure of error for this little project is root mean square error (RMSE), which is the standard deviation of the residuals. Each residual is the difference between the network’s decision and the correct answer. You’ll create a vector that holds the RMSE for each number of iterations and then plot the vector against the number of iterations. So the first line of code is rmse <- NULL Next, click the rattle Log tab and scroll down to find the R code that creates the neural network: crs$nnet <- nnet(as.factor(Class) ~ ., data=crs$dataset[crs$sample,c(crs$input, crs$target)], size=3, skip=TRUE, MaxNWts=10000, trace=FALSE, maxit=100) The values in the data argument are based on Data tab selections. The skip argument allows for the possibility of creating skip layers (layers whose connections skip over the succeeding layer). The argument of most interest here is maxit, which specifies the maximum number of iterations. Copy this code into RStudio. Set maxit to i, and put this code into a for-loop in which i goes from 2 to 90. The residuals are stored in crs$nnet$residuals. The RMSE is sd(crs$nnet$residuals). Use that to update rmse: rmse <- append(rmse,sd(crs$nnet$residuals)) So the general outline for the for-loop is for (i in 2:90){crs$nnet <- create the neural net with maxit=i) update the rmse vector } (This for-loop might take a few more seconds to run than you’re accustomed to.) Finally, use the plot() function to plot RMSE on the y-axis and to plot iterations on the x-axis: plot(x=2:90, y=rmse, type="b", xlab="Iterations", ylab= "Root Mean Square") Your plot should look like this one. Here’s one more suggested project: Take another look at the code for creating crs$nnet. Does anything suggest itself as something of interest that relates to RMSE? Something you could vary in a for-loop while holding maxit constant? And then plot RMSE against that thing? Go for it!

View ArticleArticle / Updated 04-11-2018

Discovering exactly how the neurons process inputs and send messages has sometimes been the basis for winning the Nobel prize. Now, take a look at artificial neural networks to understand how machine learning works in R programming. Overview An ML neural network consists of simulated neurons, often called units, or nodes, that work with data. Like the neurons in the nervous system, each unit receives input, performs some computation, and passes its result as a message to the next unit. At the output end, the network makes a decision based on its inputs. Imagine a neural network that uses physical measurements of flowers, like irises, to identify the flower’s species. The network takes data like the petal length and petal width of an iris and learns to classify an iris as either setosa, versicolor, or virginica. In effect, the network learns the relationship between the inputs (the petal variables) and the outputs (the species). The image below shows an artificial neural network that classifies irises. It consists of an input layer, a hidden layer, and an output layer. Each unit connects with every unit in the next layer. Numerical values called weights are on each connection. Weights can be positive or negative. To keep the image from getting cluttered, you only see the weights on the connections from the input layer to the hidden layer. Input layer and hidden layer The data points are represented in the input layer. This one has one input unit (I1) that holds the value of petal length and another (I2) that holds the value of petal width. The input units send messages to another layer of four units, called a hidden layer. The number of units in the hidden layer is arbitrary, and picking that number is part of the art of neural network creation. Each message to a hidden layer unit is the product of a data point and a connection weight. For example, H1 receives I1 multiplied by w1 along with I2 multiplied by w2. H1 processes what it receives. What does “processes what it receives” mean? H1 adds the product of I1 and w1 to the product of I2 and w2. H1 then has to send a message to O1, O2, and O3. What is the message it sends? It’s a number in a restricted range, produced by H1’s activation function. Three activation functions are common. They have exotic, math-y names: hyperbolic tangent, sigmoid, and rectified linear unit. Without going into the math, I’ll just tell you what they do. The hyperbolic tangent (known as tanh) takes a number and turns it into a number between –1 and 1. Sigmoid turns its input into a number between 0 and 1. Rectified linear unit (ReLU) replaces negative values with 0. By restricting the range of the output, activation functions set up a nonlinear relationship between the inputs and the outputs. Why is this important? In most real-world situations, you don’t find a nice, neat linear relationship between what you try to predict (the output) and the data you use to predict it (the inputs). One more item gets added into the activation function. It’s called bias. Bias is a constant that the network adds to each number coming out of the units in a layer. The best way to think about bias is that it improves the network’s accuracy. Bias is much like the intercept in a linear regression equation. Without the intercept, a regression line would pass through (0,0) and might miss many of the points it’s supposed to fit. To summarize: A hidden unit like H1 takes the data sent to it by I1 (Petal length) and I2 (Petal width), multiplies each one by the weight on its interconnection (I1 × w1 and I2 × w2), adds the products, adds the bias, and applies its activation function. Then it sends the result to all units in the output layer. Output layer The output layer consists of one unit (O1) for setosa, another (O2) for virginica, and another (O3) for versicolor. Based on the messages they receive from the hidden layer, the output units do their computations just as the hidden units do theirs. Their results determine the network’s decision about the species for the iris with the given petal length and petal width. The flow from input layer to hidden layer to output layer is called feedforward. How it all works Where do the interunit connection weights come from? They start out as numbers randomly assigned to the interunit connections. The network trains on a data set of petal lengths, petal widths, and the associated species. On each trial, the network receives a petal length and a petal width and makes a decision, which it then compares with the correct answer. Because the initial weights are random, the initial decisions are guesses. Each time the network’s decision is incorrect, the weights change based on how wrong the decision was (on the amount of error, in other words). The adjustment (which also includes changing the bias for each unit) constitutes “learning." One way of proceeding is to adjust the weights from the output layer back to the hidden layer and then from the hidden layer back to the input layer. This is called backpropagation because the amount of error “backpropagates” through the layers. A network trains until it reaches a certain level of accuracy or a preset number of iterations through the training set. In the evaluation phase, the trained network tackles a new set of data. This three-layer structure is just one way of building a neural network. Other types of networks are possible.

View ArticleArticle / Updated 04-11-2018

To introduce k-means clustering for R programming, you start by working with the iris data frame. This is the iris data frame that’s in the base R installation. Fifty flowers in each of three iris species (setosa, versicolor, and virginica) make up the data set. The data frame columns are Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. For this discussion, you’re concerned with only Petal.Length, Petal.Width, and Species. That way, you can visualize the data in two dimensions. The image below plots the iris data frame with Petal.Length on the x-axis, Petal.Width on the y-axis, and Species as the color of the plotting character. In k-means clustering, you first specify how many clusters you think the data fall into. In the image below, a reasonable assumption is 3 — the number of species. The next step is to randomly assign each data point (corresponding to a row in the data frame) to a cluster. Then find the central point of each cluster. ML honchos refer to this center as the centroid. The x-value of the centroid is the mean of the x-values of the points in the cluster, and the y-value of the centroid is the mean of the y-values of the points in the cluster. The next order of business is to calculate the distance between each point and its centroid, square that distance, and add up the squared distances. This sum-of-squared-distances-within-a-cluster is better known as the within sum of squares. Finally, and this is the crucial part, the process repeats until the within sum of squares for each cluster is as small as possible: in other words, until each data point is in the cluster with the closest centroid. It’s also possible to calculate a centroid for the entire set of observations. Its x-coordinate is the average of every data point’s x-coordinate (Petal.Length, in this example), and its y-coordinate is the average of every data point’s y-coordinate (Petal.Width, in this example). The sum of squared distances from each point to this overall centroid is called the total sum of squares. The sum of squared distances from each cluster centroid to the overall centroid is the between sum of squares. The ratio (between sum of squares)/(within sum of squares) is a measure of how well the k-means clusters fit the data. A higher number is better. If these sum-of-squares ring a bell, you’ve most likely heard of a statistical analysis technique called analysis of variance. If the ratio of those two sums of squares sounds familiar, you might remember that, in another context, that ratio’s square root is called the correlation coefficient.

View ArticleArticle / Updated 04-11-2018

SVMs work well when you have to use R to classify individuals on the basis of many features — usually, way more than in the iris data frame. Here, you learn how to create an SVM that identifies the party affiliations of members of the 1984 U.S. House of Representatives. The target variable is whether the congressperson is a Republican or a Democrat, based on their votes on 16 issues of that time. The issues range from water-project cost sharing to education spending. Nine votes are possible, but they are aggregated into the three classes y (yea), n (nay), or ? (vote not registered). (Usually, a question mark (?) signifies missing data, but not in this case.) Here are a couple of cautions to bear in mind: The name of each issue does not provide enough information to understand the entirety of the issue. Sometimes the associated bill has such convoluted wording that it’s hard to tell what a y or n vote means. Nothing here is intended as an endorsement or a disparagement of any position or of either party. This is just a machine learning exercise. You’ll find the Congressional Voting Records data set in the UCI ML repository. From this page, navigate to the Data Folder and then to the data. Press Ctrl+A to highlight all the data, and then press Ctrl+C to copy it all to the clipboard. Then this code house <- read.csv("clipboard",header=FALSE) turns the data into a data frame. At this point, the first six rows of the data frame are > head(house) V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 1 republican n y n y y y n n n y ? y y y n y 2 republican n y n y y y n n n n n y y y n ? 3 democrat ? y y ? y y n n n n y n y y n n 4 democrat n y y n ? y n n n n y n y n n y 5 democrat y y y n y y n n n n y ? y y y y 6 democrat n y y n y y n n n n n n y y y y A look at the variable names (in the data set description) shows that most of them are pretty long (like anti-satellite-test-ban). Typing them takes a lot of time, and assigning them short abbreviations might not be much more informative than V15 or V16. So just change V1 to Party: colnames(house)[1] = "Party" You can use the kernlab package to create the SVM. More specifically, you can use the rattle package, which provides a GUI to kernlab. Reading in the data With the rattle package installed, rattle() opens the Data tab. To read in the data, follow these steps: Click the R Dataset radio button to open the Data Name box. Click that box’s down arrow and select House from the menu that appears. Click to select the check box next to Partition, and then click the Execute button in the upper left corner of the window. Click the Target radio button for Party and the Input radio button for V17, and then click the Execute icon again. The Rattle Data tab should now look like this. Exploring the data Next, you’ll want to explore the data. The first thing to look at is a distribution of party affiliation. Here’s how: On the Explore tab, click the Distributions radio button and the check box next to Party. In the Group By box, select blank (the first choice) so that this box is empty. This image shows what the Explore tab looks like after all this takes place. Click Execute. That last step produces what you see below, which shows the distribution of Republicans and Democrats in the data frame. Creating the SVM On to the SVM. Follow these steps: On the Model tab, click the SVM radio button. In the Kernel box, click the down arrow and then select Linear (vanilladot) from the menu that appears. This image shows the Explore tab after these choices are made. Click the Execute icon. Clicking Execute changes the screen to look like what you see below, showing the results of the SVM. The machine found 34 support vectors and produced a Training error of .016447. Evaluating the SVM To evaluate the SVM against the Testing set, complete these steps: Click to select the Evaluate tab. For Type, click the Error Matrix radio button. For Data, click the Testing radio button. Click Execute to produce the screen shown below. The SVM incorrectly classifies 2 of the 40 Democrats as Republicans, for an overall error rate of 3 percent (2 out of 66 errors) and an average class error rate of 2.5 percent (the average of 5 percent and 0 percent). Pretty impressive.

View ArticleArticle / Updated 04-11-2018

How many data sets are perfectly linearly separable, like set.vers? R programmers know the answer: not many. In fact, here’s vers.virg, the two-thirds of the irises that aren’t setosa: vers.virg <- subset(iris, Species !="setosa") This image shows the plot of Petal.Width versus Petal.Length for this data frame. You can clearly see the slight overlap between species, and the resulting nonlinear separability. How can a classifier deal with overlap? One way is to permit some misclassification — some data points on the wrong side of the separation boundary. You may have eyeballed a separation boundary with the versicolor on the left and (most) virginica on the right. The image shows five virginica to the left of the boundary. This is called soft margin classification. As you eyeball the boundary, you should try to minimize the miscalculations. As you examine the data points, perhaps you can see a different separation boundary that works better — one that has fewer misclassifications, in other words. An SVM would find the boundary by working with a parameter called C, which specifies the number of misclassifications the SVM is willing to allow. Soft margin classification and linear separability, though, don’t always work with real data, where you can have all kinds of overlap. Sometimes you find clusters of data points from one category inside a large group of data points from another category. When that happens, it’s often necessary to have multiple nonlinear separation boundaries, as shown below. Those nonlinear boundaries define a kernel. An SVM function typically offers a choice of several ways to find a kernel. These choices have names like “linear,” “radial,” “polynomial,” and “sigmoid”. The underlying mathematics is pretty complicated, but here’s an intuitive way to think about kernels: Imagine the first image above as a page torn from this book and lying flat on the table. Suppose that you could separate the data points by moving them in a third dimension above and below the page — say, the versicolor above and the virginica below. Then it would be easy to find a separation boundary, wouldn’t it? Think of kerneling as the process of moving the data into the third dimension. (How far to move each point in the third dimension? That’s where the complicated mathematics comes in.) And the separation boundary would then be a plane, not a line.

View Article