How to Work with Lookup Tables in R - dummies

How to Work with Lookup Tables in R

By Andrie de Vries, Joris Meys

Sometimes doing a full merge of the data in R isn’t exactly what you want. In these cases, it may be more appropriate to match values in a lookup table. To do this, you can use the match() or %in% function.

How to find a match

The match() function returns the matching positions of two vectors or, more specifically, the positions of first matches of one vector in the second vector. For example, to find which large states also occur in the data frame cold.states, you can do the following:

> index <- match(cold.states$Name, large.states$Name)
> index
 [1] 1 4 NA NA 5 6 NA NA NA NA NA

As you see, the result is a vector that indicates matches were found at positions one, four, five, and six. You can use this result as an index to find all the large states that are also cold states.

Keep in mind that you need to remove the NA values first, using na.omit():

> large.states[na.omit(index), ]
    Name  Area
2  Alaska 566432
6 Colorado 103766
26 Montana 145587
28  Nevada 109889

How to make sense of %in%

A very convenient alternative to match() is the function %in%, which returns a logical vector indicating whether there is a match.

The %in% function is a special type of function called a binary operator. This means you use it by placing it between two vectors, unlike most other functions where the arguments are in parentheses:

> index <- cold.states$Name %in% large.states$Name
> index

If you compare this to the result of match(), you see that you have a TRUE value for every non-missing value in the result of match(). Or, to put it in R code, the operator %in% does the same as the following code:

> !$Name,large.states$Name))

The match() function returns the indices of the matches in the second argument for the values in the first argument. On the other hand, %in% returns TRUE for every value in the first argument that matches a value in the second argument. The order of the arguments is important here.

Because %in% returns a logical vector, you can use it directly to index values in a vector.

> cold.states[index, ]
    Name Frost
2  Alaska  152
6 Colorado  166
26 Montana  155
28  Nevada  188

As mentioned earlier, the %in% function is an example of a binary operator in R. This means that the function is used by putting it between two values, as you would for other operators, such as + (plus) and (minus). At the same time, %in% is in infix operator. An infix operator in R is identifiable by the percent signs around the function name.

If you want to know how %in% is defined, look at the details section of its Help page. But note that you have to place quotation marks around the function name to get the Help page, like this: ?”%in%”.