How to Deal with Duplicate Values in Your Data

By Alan Anderson, David Semmelroth

Data is stored in different ways in different systems. So it’s no surprise that when collecting and consolidating data from various sources, it’s possible that duplicates pop up. In particular, what makes an individual record unique is different for different systems.

An investment account summary is attached to an account number. A portfolio summary might be stored at an individual or household level. And the trading histories of all those accounts are stored at the individual transaction level.

It’s important to be clear about what is supposed to differentiate unique records in your data file. For example, if it’s a transaction level file, then account numbers and household IDs will be duplicated. As long as you understand this and are doing a transaction level analysis, you will be fine.

But if you are interested in using this data to analyze the number of accounts held by each household, you will run into problems. The households that trade more frequently will have more records than those that don’t trade very much. You need to have a file at the account level.

Removing duplicate records is not particularly difficult. Most statistical packages and database systems have built-in commands that group records together. (In fact, in the database language SQL, this command is called Group By.)