Merging and Appending Data

Data Science Essentials For Dummies

When your data is in more than one place, you need ways to put it all together. When you join two datasets with different variables, you’re merging data. Merging is a common operation. Merging is used frequently in data mining, combining linked data such as

Customer records and marketing campaign data
Before and after test results
Internal and vendor data

To merge datasets, you must have a variable that identifies cases for matching; this is called a key or identifier variable. And you may have to identify one of the datasets as primary; the primary table must have only one case for any value of the key variable.

Some data-mining applications have more than one tool for merging datasets: The first figure shows the tool for basic merges, and the second figure shows the tool for setting up more complex merge criteria.

If your data sources contain the same variables (more or less; the match does not have to be identical) but different cases, joining them is called appending or concatenation. Like merging, this is a common operation. It’s used whenever you have new cases for something that you’ve already been tracking.

The tricky part of finding the right tool is often figuring out what it’s called. Look in the menus (or search) for append, concatenate, or merge rows.

About This Article

About the book author:

Meta S. Brown helps organizations use practical data analysis to solve everyday business problems. A hands-on data miner who has tackled projects with up to $900 million at stake, she is a recognized expert in cutting-edge business analytics.