Associations between Binary Variables

By Jeff Sauro

Very often in customer analytics, you encounter binary data that takes the form of yes/no, purchase/didn’t purchase, agree/disagree, and so forth. You need to understand the association between binary variables just as you need to understand the association between continuous variables. While the principle of correlation is the same with binary data, however, the computations are different.

One of the most famous and visible examples of predictive analytics with binary data is the Amazon recommendation engine.

image0.jpg

While the exact algorithm Amazon uses is proprietary, it’s known that much of it is based on an association that indicates that a person who purchases one book also purchases another book. The recommendations are based on binary variables. To generate a recommendation, Amazon computes the proportion of customers who purchase one book and the proportion of the same customers who purchase any number of other books.

Books with the highest association are recommended first, the next-highest associations next, and so forth. The following figure shows transactions from 15 customers across four books. These could just as likely be software, groceries, songs in a playlist, TV shows, or any products or services customers can select from.

image1.jpg

If the customer purchased the book, there’s a 1 in the row; if she didn’t, there’s a 0. For example, Customer 1 purchased Book A and Book B, but not C or D. Customer 2 purchased only Book B.

To compute the association between any two book purchases, follow these steps:

  1. Count the number of customers who purchased each of these combinations of books:

    • Neither book

    • Both books

    • Only Book A

    • Only Book B

  2. Put the totals in a table, like this:

    Book B
    Book A Y N
    Y 6 2
    N 3 4

    For example, six customers bought both Books A and B.

  3. Label the table cells A to D, like this:

    Book B
    Book A Y N
    Y a b
    N c d
  4. Use the formula for the correlation between binary variables:

    image2.jpg

  5. Fill in the values for the books to find the correlation between binary variables, like this:

    image3.jpg

    In this case, the correlation between customers who purchase Book A and Book B is .327.

    A correlation between binary variables is called phi, and is represented with the Greek symbol

    image4.jpg

You can interpret the association between binary numbers the same way as the Pearson Correlation r. In fact, phi is a shortcut method for computing r. You get the same results by using the Excel Pearson formula and computing the correlation for all sets of data.

The following figure shows the data setup in Excel. The correlation between all pairs of books was computed using the =PEARSON() Excel function.

image5.jpg

Then a matrix of correlations was created for each pair of books, as shown here:

image6.jpg

Confirming the earlier result, the correlation between Book A and B is .33. The second-highest correlation is between Book A and Book D at .25.

The correlation between Book B and Book C is -.48. This negative correlation means that customers who purchase Book B are less likely to purchase Book C.

So if a customer is viewing and considering purchasing Book A, it would make sense to recommend (and possibly offer that customer an incentive) to also purchase Book B and D, but not Book C.

You may hear the terms Basket Analysis or Affinity Analysis. Both of these are just other names for finding associations and correlations between variables. It’s like examining customers’ shopping baskets in a grocery store to see what items are purchased together.