How to Use the merge() Function with Data Sets in R
In R you use the merge() function to combine data frames. This powerful function tries to identify columns or rows that are common between the two different data frames.
How to use merge to find the intersection of data
The simplest form of merge() finds the intersection between two different sets of data. In other words, to create a data frame that consists of those states that are cold as well as large, use the default version of merge():
> merge(cold.states, large.states) Name Frost Area 1 Alaska 152 566432 2 Colorado 166 103766 3 Montana 155 145587 4 Nevada 188 109889
If you’re familiar with a database language such as SQL, you may have guessed that merge() is very similar to a database join. This is, indeed, the case and the different arguments to merge() allow you to perform natural joins, as well as left, right, and full outer joins.
The merge() function takes quite a large number of arguments. These arguments can look quite intimidating until you realize that they form a smaller number of related arguments:
x: A data frame.
y: A data frame.
by, by.x, by.y: The names of the columns that are common to both x and y. The default is to use the columns with common names between the two data frames.
all, all.x, all.y: Logical values that specify the type of merge. The default value is all=FALSE (meaning that only the matching rows are returned).
That last group of arguments — all, all.x and all.y — deserves some explanation. These arguments determine the type of merge that will happen.
How to understand the different types of merge
The merge() function allows four ways of combining data:
Natural join: To keep only rows that match from the data frames, specify the argument all=FALSE.
Full outer join: To keep all rows from both data frames, specify all=TRUE.
Left outer join: To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE.
Right outer join: To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE.
How to find the union (full outer join)
Returning to the examples of U.S. states, to perform a complete merge of cold and large states, use merge and specify all=TRUE:
> merge(cold.states, large.states, all=TRUE) Name Frost Area 1 Alaska 152 566432 2 Arizona NA 113417 3 California NA 156361 .... 13 Texas NA 262134 14 Vermont 168 NA 15 Wyoming 173 NA
Both data frames have a variable Name, so R matches the cases based on the names of the states. The variable Frost comes from the data frame cold.states, and the variable Area comes from the data frame large.states.
Note that this performs the complete merge and fills the columns with NA values where there is no matching data.