R Project for RFM Analysis: Another Data Set

By Joseph Schmuller

If you’re interested in trying out your RFM analysis skills on another set of data, this R project is for you. The CDNOW data set consists of almost 70,000 rows. It’s a record of sales at CDNOW from the beginning of January 1997 through the end of June 1998.

Press Ctrl+A to highlight all the data, and press Ctrl+C to copy to the clipboard. Then use the read.csv() function to read the data into R:

cdNOW <- read.csv("clipboard", header=FALSE, sep = "")

Here’s how to name the columns:

colnames(cdNOW) <- c("CustomerID","InvoiceDate","Quantity","Amount")

The data should look like this:

> head(cdNOW)
  CustomerID InvoiceDate Quantity Amount
1          1    19970101        1  11.77
2          2    19970112        1  12.00
3          2    19970112        5  77.00
4          3    19970102        2  20.76
5          3    19970330        2  20.76
6          3    19970402        2  19.54

It’s less complicated than the Online Retail project because Amount is the total amount of the transaction. So each row is a transaction, and aggregation is not necessary. The Quantity column is irrelevant for our purposes.

Here’s a hint about reformatting the InvoiceDate: The easiest way to get it into R date format is to download and install the lubridate package and use its ymd() function:

cdNOW$InvoiceDate <-ymd(cdNOW$InvoiceDate)

After that change, here’s how the first six rows look:

> head(cdNOW)
  CustomerID InvoiceDate Quantity Amount
1          1  1997-01-01        1  11.77
2          2  1997-01-12        1  12.00
3          2  1997-01-12        5  77.00
4          3  1997-01-02        2  20.76
5          3  1997-03-30        2  20.76
6          3  1997-04-02        2  19.54

Almost there. What’s missing for findRFM()? An invoice number. So you have to use a little trick to make one up. The trick is to use each row identifier in the row-identifier column as the invoice number. To turn the row-identifier column into a data frame column, download and install the tibble package and use its rownames_to_column() function:

cdNOW <- rownames_to_column(cdNOW, "InvoiceNumber")

Here’s the data:

> head(cdNOW)
  InvoiceNumber CustomerID InvoiceDate Quantity Amount
1             1          1  1997-01-01        1  11.77
2             2          2  1997-01-12        1  12.00
3             3          2  1997-01-12        5  77.00
4             4          3  1997-01-02        2  20.76
5             5          3  1997-03-30        2  20.76
6             6          3  1997-04-02        2  19.54

Now create a data frame with everything but that Quantity column and you’re ready.

See how much of the Online Retail project you can accomplish in this one.

Happy analyzing!