How to Use the merge() Function with Data Sets in R - dummies

How to Use the merge() Function with Data Sets in R

By Andrie de Vries, Joris Meys

In R you use the merge() function to combine data frames. This powerful function tries to identify columns or rows that are common between the two different data frames.

How to use merge to find the intersection of data

The simplest form of merge() finds the intersection between two different sets of data. In other words, to create a data frame that consists of those states that are cold as well as large, use the default version of merge():

> merge(cold.states, large.states)
   Name Frost  Area
1  Alaska  152 566432
2 Colorado  166 103766
3 Montana  155 145587
4  Nevada  188 109889

If you’re familiar with a database language such as SQL, you may have guessed that merge() is very similar to a database join. This is, indeed, the case and the different arguments to merge() allow you to perform natural joins, as well as left, right, and full outer joins.

The merge() function takes quite a large number of arguments. These arguments can look quite intimidating until you realize that they form a smaller number of related arguments:

  • x: A data frame.

  • y: A data frame.

  • by, by.x, by.y: The names of the columns that are common to both x and y. The default is to use the columns with common names between the two data frames.

  • all, all.x, all.y: Logical values that specify the type of merge. The default value is all=FALSE (meaning that only the matching rows are returned).

That last group of arguments — all, all.x and all.y — deserves some explanation. These arguments determine the type of merge that will happen.

How to understand the different types of merge

The merge() function allows four ways of combining data:

  • Natural join: To keep only rows that match from the data frames, specify the argument all=FALSE.

  • Full outer join: To keep all rows from both data frames, specify all=TRUE.

  • Left outer join: To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE.

  • Right outer join: To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE.


How to find the union (full outer join)

Returning to the examples of U.S. states, to perform a complete merge of cold and large states, use merge and specify all=TRUE:

> merge(cold.states, large.states, all=TRUE)
      Name Frost  Area
1     Alaska  152 566432
2    Arizona  NA 113417
3   California  NA 156361
13     Texas  NA 262134
14    Vermont  168   NA
15    Wyoming  173   NA

Both data frames have a variable Name, so R matches the cases based on the names of the states. The variable Frost comes from the data frame cold.states, and the variable Area comes from the data frame large.states.

Note that this performs the complete merge and fills the columns with NA values where there is no matching data.