# R Articles

R is a favorite of data scientists and statisticians everywhere, with its ability to crunch large datasets and deal with scientific information.

## Articles From R

### Filter Results

Cheat Sheet / Updated 05-02-2022

To complete any project using R, you work with functions that live in packages designed for specific areas. This cheat sheet provides some information about these functions.

View Cheat SheetCheat Sheet / Updated 02-14-2022

R is more than just a statistical programming language. It’s also a powerful tool for all kinds of data processing and manipulation, used by a community of programmers and users, academics, and practitioners. To get the most out of R, you need to know how to access the R Help files and find help from other sources. To represent data in R, you need to be able to succinctly and correctly specify subsets of your data. Finally, R has many functions that allow you to import data from other applications.

View Cheat SheetCheat Sheet / Updated 01-26-2022

R provides a wide array of functions to help you with statistical analysis with R—from simple statistics to complex analyses. Several statistical functions are built into R and R packages. R statistical functions fall into several categories including central tendency and variability, relative standing, t-tests, analysis of variance and regression analysis.

View Cheat SheetArticle / Updated 11-04-2021

The rbind() function in the R programming language conveniently adds the names of the vectors to the rows of the matrix. You name the values in a vector, and you can do something very similar with rows and columns in a matrix. For that, you have the functions rownames() and colnames(). Guess which one does what? Both functions work much like the names() function you use when naming vector values. Changing the row and column names The matrix baskets.team already has some row names. It would be better if the names of the rows would just read “Granny” and “Geraldine”. You can easily change these row names like this: > rownames(baskets.team) <- c(“Granny”, “Geraldine”) You can look at the matrix to check if this did what it’s supposed to do, or you can take a look at the row names itself like this: > rownames(baskets.team) [1] “Granny” “Geraldine” The colnames() function works exactly the same. You can, for example, add the number of the game as a column name using the following code: > colnames(baskets.team) <- c(“1st”, “2nd”, “3th”, “4th”, “5th”, “6th”) This gives you the following matrix: > baskets.team 1st 2nd 3th 4th 5th 6th Granny 12 4 5 6 9 3 Geraldine 5 4 2 4 12 9 This is almost like you want it, but the third column name contains an annoying writing mistake. No problem there; R allows you to easily correct that mistake. Just as the with names() function, you can use indices to extract or to change a specific row or column name. You can correct the mistake in the column names like this: > colnames(baskets.team)[3] <- “3rd” If you want to get rid of either column names or row names, the only thing you need to do is set their value to NULL. This also works for vector names, by the way. You can try that out yourself on a copy of the matrix baskets.team like this: > baskets.copy <- baskets.team > colnames(baskets.copy) <- NULL > baskets.copy [,1] [,2] [,3] [,4] [,5] [,6] Granny 12 4 5 6 9 3 Geraldine 5 4 2 4 12 9 R stores the row and column names in an attribute called dimnames. Use the dimnames() function to extract or set those values. Using names as indices These row and column names can be used just like you use names for values in a vector. You can use these names instead of the index number to select values from a vector. This works for matrices as well, using the row and column names. Say you want to select the second and the fifth game for both ladies; try: > baskets.team[, c(“2nd”, “5th”)] 2nd 5th Granny 4 9 Geraldine 4 12 Exactly as before, you get all rows if you don’t specify which ones you want. Alternatively, you can extract all the results for Granny like this: > baskets.team[“Granny”, ] 1st 2nd 3rd 4th 5th 6th 12 4 5 6 9 3 That’s the result, indeed, but the row name is gone now. R tries to simplify the matrix to a vector, if that’s possible. In this case, a single row is returned so, by default, this result is transformed to a vector. If a one-row matrix is simplified to a vector, the column names are used as names for the values. If a one-column matrix is simplified to a vector, the row names are used as names for the vector. If you want to keep all names, you must set the argument drop to FALSE to avoid conversion to a vector.

View ArticleArticle / Updated 10-28-2021

In the R programming language, a conversion from a matrix to a data frame can’t be used to construct a data frame with different types of values. If you combine both numeric and character data in a matrix, for example, everything will be converted to character. You can construct a data frame from scratch, though, using the data.frame() function. Once a data frame is created, you can add observations to a data frame. Make a data frame from vectors in R So, let’s make a little data frame with the names, salaries, and starting dates of a few imaginary co-workers. First, you create three vectors that contain the necessary information like this: > employee <- c('John Doe','Peter Gynn','Jolie Hope') > salary <- c(21000, 23400, 26800) > startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14')) Now you have three different vectors in your workspace: A character vector called employee, containing the names A numeric vector called salary, containing the yearly salaries A date vector called startdate, containing the dates on which the co-workers started Next, you combine the three vectors into a data frame using the following code: > employ.data <- data.frame(employee, salary, startdate) The result of this is a data frame, employ.data, with the following structure: > str(employ.data) 'data.frame': 3 obs. of 3 variables: $ employee : Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2 $ salary : num 21000 23400 26800 $ startdate: Date, format: "2010-11-01" "2008-03-25" ... To combine a number of vectors into a data frame, you simply add all vectors as arguments to the data.frame() function, separated by commas. R will create a data frame with the variables that are named the same as the vectors used. Keep characters as characters in R You may have noticed something odd when looking at the structure of employ.data. Whereas the vector employee is a character vector, R made the variable employee in the data frame a factor. R does this by default, but you have an extra argument to the data.frame() function that can avoid this — namely, the argument stringsAsFactors. In the employ.data example, you can prevent the transformation to a factor of the employee variable by using the following code: > employ.data <- data.frame(employee, salary, startdate, stringsAsFactors=FALSE) If you look at the structure of the data frame now, you see that the variable employee is a character vector, as shown in the following output: > str(employ.data) 'data.frame': 3 obs. of 3 variables: $ employee : chr "John Doe" "Peter Gynn" "Jolie Hope" $ salary : num 21000 23400 26800 $ startdate: Date, format: "2010-11-01" "2008-03-25" ... By default, R always transforms character vectors to factors when creating a data frame with character vectors or converting a character matrix to a data frame. This can be a nasty cause of errors in your code if you’re not aware of it. If you make it a habit to always specify the stringsAsFactors argument, you can avoid a lot of frustration.

View ArticleArticle / Updated 09-30-2019

After you calculate the variance of a set of numbers, you have a value whose units are different from your original measurements. For example, if your original measurements are in inches, their variance is in square inches. This is because you square the deviations before you average them. So the variance in the five-score population in the preceding example is 6.8 square inches. It might be hard to grasp what that means. Often, it's more intuitive if the variation statistic is in the same units as the original measurements. It's easy to turn variance into that kind of statistic. All you have to do is take the square root of the variance. Like the variance, this square root is so important that it is has a special name: standard deviation. Population standard deviation The standard deviation of a population is the square root of the population variance. The symbol for the population standard deviation is Σ (sigma). Its formula is For this 5-score population of measurements (in inches): 50, 47, 52, 46, and 45 the population variance is 6.8 square inches, and the population standard deviation is 2.61 inches (rounded off). Sample standard deviation The standard deviation of a sample — an estimate of the standard deviation of a population — is the square root of the sample variance. Its symbol is s and its formula is For this sample of measurements (in inches): 50, 47, 52, 46, and 45 the estimated population variance is 8.4 square inches, and the estimated population standard deviation is 2.92 inches (rounded off). Using R to compute standard deviation As is the case with variance, using R to compute the standard deviation is easy: You use the sd() function. And like its variance counterpart, sd() calculates s, not Σ: > sd(heights) [1] 2.915476 For Σ — treating the five numbers as a self-contained population, in other words — you have to multiply the sd() result by the square root of (N-1)/N: > sd(heights)*(sqrt((length(heights)-1)/length(heights))) [1] 2.607681 Again, if you're going to use this one frequently, defining a function is a good idea: sd.p=function(x){sd(x)*sqrt((length(x)-1)/length(x))} And here's how you use this function: > sd.p(heights) [1] 2.607681

View ArticleArticle / Updated 04-11-2018

If you’ve been working with images, animated images, and combined stationary images in R, it may be time to take the next step. This project walks you through the next step: Combine an image with an animated image. This image shows the end product — the plot of the iris data set with comedy icons Laurel and Hardy positioned in front of the plot legend. When you open this combined image in the Viewer, you see Stan and Ollie dancing their little derbies off. (The derbies don’t actually come off in the animation, but you get the drift.) Getting Stan and Ollie Check out the Laurel and Hardy GIF. Right-click the image and select Save Image As from the pop-up menu that appears. Save it as animated-dancing-image-0243 in your Documents folder. Then read it into R: l_and_h <- image_read("animated-dancing-image-0243.gif") Applying the length() function to l_and_h > length(l_and_h) [1] 10 indicates that this GIF consists of ten frames. To add a coolness factor, make the background of the GIF transparent before image_read() works with it. This free online image editor does the job quite nicely. Combining the boys with the background If you use the image combination technique, the code looks like this: image_composite(image=background, composite_image=l_and_h, offset = "+510+200") The picture it produces looks like the image above but with one problem: The boys aren’t dancing. Why is that? The reason is that image_composite() combined the background with just the first frame of l_and_h, not with all ten. It’s exactly the same as if you had run image_composite(image=background, composite_image=l_and_h[1], offset = "+510+200") The length() function verifies this: > length(image_composite(image=background, composite_image=l_and_h, offset = "+510+200")) [1] 1 If all ten frames were involved, the length() function would have returned 10. To get this done properly, you have to use a magick function called image_apply(). Explaining image_apply() So that you fully understand how this important function works, let's describe an analogous function called lapply(). If you want to apply a function (like mean()) to the variables of a data frame, like iris, one way to do that is with a for loop: Start with the first column and calculate its mean, go to the next column and calculate its mean, and so on until you calculate all the column means. For technical reasons, it’s faster and more efficient to use lapply() to apply mean() to all the variables: > lapply(iris, mean) $Sepal.Length [1] 5.843333 $Sepal.Width [1] 3.057333 $Petal.Length [1] 3.758 $Petal.Width [1] 1.199333 $Species [1] NA A warning message comes with that last one, but that’s okay. Another way to write lapply(iris, mean) is lapply(iris, function(x){mean(x)}). This second way comes in handy when the function becomes more complicated. If, for some reason, you want to square the value of each score in the data set and then multiply the result by three, and then calculate the mean of each column, here’s how to code it: lapply(iris, function(x){mean(3*(x^2))}) In a similar way, image_apply() applies a function to every frame in an animated GIF. In this project, the function that gets applied to every frame is image_composite(): function(frame){image_composite(image=background, composite_image=frame, offset = "+510+200")} So, within image_apply(), that’s frames <- image_apply(image=l_and_h, function(frame) { image_composite(image=background, composite_image=frame, offset = "+510+200") }) After you run that code, length(frames) verifies the ten frames: > length(frames) [1] 10 Getting back to the animation The image_animate() function puts it all in motion at ten frames per second: animation <- image_animate(frames, fps = 10) To put the show on the screen, it’s print(animation) All together now: l_and_h <- image_read("animated-dancing-image-0243.gif") background <- image_background(iris_plot, "white) frames <- image_apply(image=l_and_h, function(frame) { image_composite(image=background, composite_image=frame, offset = "+510+200") }) animation <- image_animate(frames, fps = 10) print(animation) And that’s the code for the image above. One more thing. The image_write() function saves the animation as a handy little reusable GIF: image_write(animation, "LHirises.gif")

View ArticleArticle / Updated 04-11-2018

Here, you learn about books and websites that help you learn more about R programming. Without further ado. . . Interacting with users If you want to delve deeper into R applications that interact with users, start with this tutorial by shiny guiding force Garrett Grolemund. For a helpful book on the subject, consider Chris Beeley’s web Application Development with R Using Shiny, 2nd Edition (Packt Publishing, 2016). Machine learning For the lowdown on all things Rattle, go directly to the source: Rattle creator Graham Williams has written Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Springer, 2011). Check out the companion website. The University of California-Irvine Machine Learning Repository plays such a huge role in the R programming world. Here’s how its creator prefers that you look for the material: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Thank you, UCI Anteaters! If machine learning interests you, take a comprehensive look at the field (under its other name, “statistical learning”): Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani’s An Introduction to Statistical Learning with Applications in R (Springer, 2017). An Introduction to Neural Networks, by Ben Krose and Patrick van der Smagt, is a little dated, but you can get it for the low, low price of nothing: After you download a large PDF, it’s a good idea to upload it into an ebook app, like Google Play Books. That turns the PDF into an ebook and makes it easier to navigate on a tablet. Databases The R-bloggers website has a nice article on working with databases. Of course, R-bloggers has terrific articles on a lot of R-related topics! You can learn quite a bit about RFM (Recency Frequency Money) analysis and customer segmentation at www.putler.com/rfm-analysis. Maps and images The area of maps is a fascinating one. You might be interested in something at a higher level. If so, read Introduction to visualising spatial data in R by Robin Lovelace, James Cheshire, Rachel Oldroyd (and others). David Kahle and Hadley Wickham’s ggmap: Spatial Visualization with ggplot2 is also at a higher level. Fascinated by magick? The best place to go is the primary source. Check it out.

View ArticleArticle / Updated 04-11-2018

Try out this R project to see how one variable might affect an outcome. It’s conceivable that weather conditions could influence flight delays. How do you incorporate weather information into the assessment of delay? One nycflights13 data frame called weather provides the weather data for every day and hour at each of the three origin airports. Here’s a glimpse of exactly what it has: > glimpse(weather,60) Observations: 26,130 Variables: 15 $ origin "EWR", "EWR", "EWR", "EWR", "EWR", "... $ year 2013, 2013, 2013, 2013, 2013, 2013, ... $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ day 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ hour 0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 1... $ temp 37.04, 37.04, 37.94, 37.94, 37.94, 3... $ dewp 21.92, 21.92, 21.92, 23.00, 24.08, 2... $ humid 53.97, 53.97, 52.09, 54.51, 57.04, 5... $ wind_dir 230, 230, 230, 230, 240, 270, 250, 2... $ wind_speed 10.35702, 13.80936, 12.65858, 13.809... $ wind_gust 11.918651, 15.891535, 14.567241, 15.... $ precip 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ pressure 1013.9, 1013.0, 1012.6, 1012.7, 1012... $ visib 10, 10, 10, 10, 10, 10, 10, 10, 10, ... $ time_hour 2012-12-31 19:00:00, 2012-12-31 20:... So the variables it has in common with flites_name_day are the first six and the last one. To join the two data frames, use this code: flites_day_weather <- flites_day %>% inner_join(weather, by = c("origin","year","month","day","hour","time_hour")) Now you can use flites_day_weather to start answering questions about departure delay and the weather. What questions will you ask? How will you answer them? What plots will you draw? What regression lines will you create? Will scale() help? And, when you’re all done, take a look at arrival delay (arr_delay).

View ArticleArticle / Updated 04-11-2018

If you’re interested in trying out your RFM analysis skills on another set of data, this R project is for you. The CDNOW data set consists of almost 70,000 rows. It’s a record of sales at CDNOW from the beginning of January 1997 through the end of June 1998. Press Ctrl+A to highlight all the data, and press Ctrl+C to copy to the clipboard. Then use the read.csv() function to read the data into R: cdNOW <- read.csv("clipboard", header=FALSE, sep = "") Here’s how to name the columns: colnames(cdNOW) <- c("CustomerID","InvoiceDate","Quantity","Amount") The data should look like this: > head(cdNOW) CustomerID InvoiceDate Quantity Amount 1 1 19970101 1 11.77 2 2 19970112 1 12.00 3 2 19970112 5 77.00 4 3 19970102 2 20.76 5 3 19970330 2 20.76 6 3 19970402 2 19.54 It’s less complicated than the Online Retail project because Amount is the total amount of the transaction. So each row is a transaction, and aggregation is not necessary. The Quantity column is irrelevant for our purposes. Here’s a hint about reformatting the InvoiceDate: The easiest way to get it into R date format is to download and install the lubridate package and use its ymd() function: cdNOW$InvoiceDate <-ymd(cdNOW$InvoiceDate) After that change, here’s how the first six rows look: > head(cdNOW) CustomerID InvoiceDate Quantity Amount 1 1 1997-01-01 1 11.77 2 2 1997-01-12 1 12.00 3 2 1997-01-12 5 77.00 4 3 1997-01-02 2 20.76 5 3 1997-03-30 2 20.76 6 3 1997-04-02 2 19.54 Almost there. What’s missing for findRFM()? An invoice number. So you have to use a little trick to make one up. The trick is to use each row identifier in the row-identifier column as the invoice number. To turn the row-identifier column into a data frame column, download and install the tibble package and use its rownames_to_column() function: cdNOW <- rownames_to_column(cdNOW, "InvoiceNumber") Here’s the data: > head(cdNOW) InvoiceNumber CustomerID InvoiceDate Quantity Amount 1 1 1 1997-01-01 1 11.77 2 2 2 1997-01-12 1 12.00 3 3 2 1997-01-12 5 77.00 4 4 3 1997-01-02 2 20.76 5 5 3 1997-03-30 2 20.76 6 6 3 1997-04-02 2 19.54 Now create a data frame with everything but that Quantity column and you’re ready. See how much of the Online Retail project you can accomplish in this one. Happy analyzing!

View Article