{"appState":{"pageLoadApiCallsStatus":true},"articleState":{"article":{"headers":{"creationTime":"2016-03-26T07:28:13+00:00","modifiedTime":"2016-03-26T07:28:13+00:00","timestamp":"2022-09-14T17:49:54+00:00"},"data":{"breadcrumbs":[{"name":"Technology","_links":{"self":"https://dummies-api.dummies.com/v2/categories/33512"},"slug":"technology","categoryId":33512},{"name":"Information Technology","_links":{"self":"https://dummies-api.dummies.com/v2/categories/33572"},"slug":"information-technology","categoryId":33572},{"name":"Data Science","_links":{"self":"https://dummies-api.dummies.com/v2/categories/33577"},"slug":"data-science","categoryId":33577},{"name":"Big Data","_links":{"self":"https://dummies-api.dummies.com/v2/categories/33578"},"slug":"big-data","categoryId":33578}],"title":"How to Deal with Duplicate Values in Your Data","strippedTitle":"how to deal with duplicate values in your data","slug":"how-to-deal-with-duplicate-values-in-your-data","canonicalUrl":"","seo":{"metaDescription":"Data is stored in different ways in different systems. So it's no surprise that when collecting and consolidating data from various sources, it's possible that ","noIndex":0,"noFollow":0},"content":"<p>Data is stored in different ways in different systems. So it's no surprise that when collecting and consolidating data from various sources, it's possible that duplicates pop up. In particular, what makes an individual record unique is different for different systems.</p>\n<p>An investment account summary is attached to an account number. A portfolio summary might be stored at an individual or household level. And the trading histories of all those accounts are stored at the individual transaction level.</p>\n<p class=\"Remember\">It's important to be clear about what is supposed to differentiate unique records in your data file. For example, if it's a transaction level file, then account numbers and household IDs will be duplicated. As long as you understand this and are doing a transaction level analysis, you will be fine.</p>\n<p>But if you are interested in using this data to analyze the number of accounts held by each household, you will run into problems. The households that trade more frequently will have more records than those that don't trade very much. You need to have a file at the account level.</p>\n<p class=\"Remember\">Removing duplicate records is not particularly difficult. Most statistical packages and database systems have built-in commands that group records together. (In fact, in the database language SQL, this command is called Group By.)</p>","description":"<p>Data is stored in different ways in different systems. So it's no surprise that when collecting and consolidating data from various sources, it's possible that duplicates pop up. In particular, what makes an individual record unique is different for different systems.</p>\n<p>An investment account summary is attached to an account number. A portfolio summary might be stored at an individual or household level. And the trading histories of all those accounts are stored at the individual transaction level.</p>\n<p class=\"Remember\">It's important to be clear about what is supposed to differentiate unique records in your data file. For example, if it's a transaction level file, then account numbers and household IDs will be duplicated. As long as you understand this and are doing a transaction level analysis, you will be fine.</p>\n<p>But if you are interested in using this data to analyze the number of accounts held by each household, you will run into problems. The households that trade more frequently will have more records than those that don't trade very much. You need to have a file at the account level.</p>\n<p class=\"Remember\">Removing duplicate records is not particularly difficult. Most statistical packages and database systems have built-in commands that group records together. (In fact, in the database language SQL, this command is called Group By.)</p>","blurb":"","authors":[{"authorId":9080,"name":"Alan Anderson","slug":"alan-anderson","description":" <p><b>Alan Anderson</b>, PhD is a teacher of finance, economics, statistics, and math at Fordham and Fairfield universities as well as at Manhattanville and Purchase colleges. Outside of the academic environment he has many years of experience working as an economist, risk manager, and fixed income analyst. Alan received his PhD in economics from Fordham University, and an M.S. in financial engineering from Polytechnic University.</p>","hasArticle":false,"_links":{"self":"https://dummies-api.dummies.com/v2/authors/9080"}},{"authorId":9081,"name":"David Semmelroth","slug":"david-semmelroth","description":" <p><b>David Semmelroth</b> has two decades of experience translating customer data into actionable insights across the financial services, travel, and entertainment industries. David has consulted for Cedar Fair, Wachovia, National City, and TD Bank.</p>","hasArticle":false,"_links":{"self":"https://dummies-api.dummies.com/v2/authors/9081"}}],"primaryCategoryTaxonomy":{"categoryId":33578,"title":"Big Data","slug":"big-data","_links":{"self":"https://dummies-api.dummies.com/v2/categories/33578"}},"secondaryCategoryTaxonomy":{"categoryId":0,"title":null,"slug":null,"_links":null},"tertiaryCategoryTaxonomy":{"categoryId":0,"title":null,"slug":null,"_links":null},"trendingArticles":null,"inThisArticle":[],"relatedArticles":{"fromBook":[{"articleId":207478,"title":"Statistics for Big Data For Dummies Cheat Sheet","slug":"statistics-for-big-data-for-dummies-cheat-sheet","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/207478"}},{"articleId":142226,"title":"Discrete and Continuous Probability Distributions","slug":"discrete-and-continuous-probability-distributions","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/142226"}},{"articleId":142209,"title":"10 Key Concepts in Hypothesis Testing","slug":"10-key-concepts-in-hypothesis-testing","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/142209"}},{"articleId":142192,"title":"Overview of Graphical Techniques","slug":"overview-of-graphical-techniques","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/142192"}},{"articleId":142191,"title":"Overview of Hypothesis Testing","slug":"overview-of-hypothesis-testing","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/142191"}}],"fromCategory":[{"articleId":207996,"title":"Big Data For Dummies Cheat Sheet","slug":"big-data-for-dummies-cheat-sheet","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/207996"}},{"articleId":207478,"title":"Statistics for Big Data For Dummies Cheat Sheet","slug":"statistics-for-big-data-for-dummies-cheat-sheet","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/207478"}},{"articleId":207432,"title":"Big Data for Small Business For Dummies Cheat Sheet","slug":"big-data-for-small-business-for-dummies-cheat-sheet","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/207432"}},{"articleId":168988,"title":"Integrate Big Data with the Traditional Data Warehouse","slug":"integrate-big-data-with-the-traditional-data-warehouse","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/168988"}},{"articleId":168986,"title":"Big Data Planning Stages","slug":"big-data-planning-stages","categoryList":["technology","information-technology","data-science","big-data"],"_links":{"self":"https://dummies-api.dummies.com/v2/articles/168986"}}]},"hasRelatedBookFromSearch":false,"relatedBook":{"bookId":282602,"slug":"statistics-for-big-data-for-dummies","isbn":"9781118940013","categoryList":["technology","information-technology","data-science","big-data"],"amazon":{"default":"https://www.amazon.com/gp/product/1118940016/ref=as_li_tl?ie=UTF8&tag=wiley01-20","ca":"https://www.amazon.ca/gp/product/1118940016/ref=as_li_tl?ie=UTF8&tag=wiley01-20","indigo_ca":"http://www.tkqlhce.com/click-9208661-13710633?url=https://www.chapters.indigo.ca/en-ca/books/product/1118940016-item.html&cjsku=978111945484","gb":"https://www.amazon.co.uk/gp/product/1118940016/ref=as_li_tl?ie=UTF8&tag=wiley01-20","de":"https://www.amazon.de/gp/product/1118940016/ref=as_li_tl?ie=UTF8&tag=wiley01-20"},"image":{"src":"https://www.dummies.com/wp-content/uploads/statistics-for-big-data-for-dummies-cover-9781118940013-203x255.jpg","width":203,"height":255},"title":"Statistics for Big Data For Dummies","testBankPinActivationLink":"","bookOutOfPrint":false,"authorsInfo":"<p><b data-author-id=\"9080\">Alan Anderson, PhD,</b> is a professor of economics and finance at Fordham University and New York University. He's a veteran economist, risk manager, and fixed income analyst.</p> <p><b data-author-id=\"9081\">David Semmelroth</b> is an experienced data analyst, trainer, and statistics instructor who consults on customer databases and database marketing.</p>","authors":[{"authorId":9080,"name":"Alan Anderson","slug":"alan-anderson","description":" <p><b>Alan Anderson</b>, PhD is a teacher of finance, economics, statistics, and math at Fordham and Fairfield universities as well as at Manhattanville and Purchase colleges. Outside of the academic environment he has many years of experience working as an economist, risk manager, and fixed income analyst. Alan received his PhD in economics from Fordham University, and an M.S. in financial engineering from Polytechnic University.</p>","hasArticle":false,"_links":{"self":"https://dummies-api.dummies.com/v2/authors/9080"}},{"authorId":9081,"name":"David Semmelroth","slug":"david-semmelroth","description":" <p><b>David Semmelroth</b> has two decades of experience translating customer data into actionable insights across the financial services, travel, and entertainment industries. David has consulted for Cedar Fair, Wachovia, National City, and TD Bank.</p>","hasArticle":false,"_links":{"self":"https://dummies-api.dummies.com/v2/authors/9081"}}],"_links":{"self":"https://dummies-api.dummies.com/v2/books/"}},"collections":[],"articleAds":{"footerAd":"<div class=\"du-ad-region row\" id=\"article_page_adhesion_ad\"><div class=\"du-ad-unit col-md-12\" data-slot-id=\"article_page_adhesion_ad\" data-refreshed=\"false\" \r\n data-target = \"[{"key":"cat","values":["technology","information-technology","data-science","big-data"]},{"key":"isbn","values":["9781118940013"]}]\" id=\"du-slot-632214427aacb\"></div></div>","rightAd":"<div class=\"du-ad-region row\" id=\"article_page_right_ad\"><div class=\"du-ad-unit col-md-12\" data-slot-id=\"article_page_right_ad\" data-refreshed=\"false\" \r\n data-target = \"[{"key":"cat","values":["technology","information-technology","data-science","big-data"]},{"key":"isbn","values":["9781118940013"]}]\" id=\"du-slot-632214427b020\"></div></div>"},"articleType":{"articleType":"Articles","articleList":null,"content":null,"videoInfo":{"videoId":null,"name":null,"accountId":null,"playerId":null,"thumbnailUrl":null,"description":null,"uploadDate":null}},"sponsorship":{"sponsorshipPage":false,"backgroundImage":{"src":null,"width":0,"height":0},"brandingLine":"","brandingLink":"","brandingLogo":{"src":null,"width":0,"height":0},"sponsorAd":"","sponsorEbookTitle":"","sponsorEbookLink":"","sponsorEbookImage":{"src":null,"width":0,"height":0}},"primaryLearningPath":"Advance","lifeExpectancy":null,"lifeExpectancySetFrom":null,"dummiesForKids":"no","sponsoredContent":"no","adInfo":"","adPairKey":[]},"status":"publish","visibility":"public","articleId":141224},"articleLoadedStatus":"success"},"listState":{"list":{},"objectTitle":"","status":"initial","pageType":null,"objectId":null,"page":1,"sortField":"time","sortOrder":1,"categoriesIds":[],"articleTypes":[],"filterData":{},"filterDataLoadedStatus":"initial","pageSize":10},"adsState":{"pageScripts":{"headers":{"timestamp":"2024-03-04T05:50:01+00:00"},"adsId":0,"data":{"scripts":[{"pages":["all"],"location":"header","script":"\r\n<script src=\"https://cdn.optimizely.com/js/10563184655.js\"></script>","enabled":false},{"pages":["all"],"location":"header","script":"\r\n<script>var _comscore = _comscore || [];_comscore.push({ c1: \"2\", c2: \"15097263\" });(function() {var s = document.createElement(\"script\"), el = document.getElementsByTagName(\"script\")[0]; s.async = true;s.src = (document.location.protocol == \"https:\" ? \"https://sb\" : \"http://b\") + \".scorecardresearch.com/beacon.js\";el.parentNode.insertBefore(s, el);})();</script><noscript><img src=\"https://sb.scorecardresearch.com/p?c1=2&c2=15097263&cv=2.0&cj=1\" /></noscript>\r\n","enabled":true},{"pages":["all"],"location":"footer","script":"\r\n<script type='text/javascript'>\r\n(function(){var g=function(e,h,f,g){\r\nthis.get=function(a){for(var a=a+\"=\",c=document.cookie.split(\";\"),b=0,e=c.length;b<e;b++){for(var d=c[b];\" \"==d.charAt(0);)d=d.substring(1,d.length);if(0==d.indexOf(a))return d.substring(a.length,d.length)}return null};\r\nthis.set=function(a,c){var b=\"\",b=new Date;b.setTime(b.getTime()+6048E5);b=\"; expires=\"+b.toGMTString();document.cookie=a+\"=\"+c+b+\"; path=/; \"};\r\nthis.check=function(){var a=this.get(f);if(a)a=a.split(\":\");else if(100!=e)\"v\"==h&&(e=Math.random()>=e/100?0:100),a=[h,e,0],this.set(f,a.join(\":\"));else return!0;var c=a[1];if(100==c)return!0;switch(a[0]){case \"v\":return!1;case \"r\":return c=a[2]%Math.floor(100/c),a[2]++,this.set(f,a.join(\":\")),!c}return!0};\r\nthis.go=function(){if(this.check()){var a=document.createElement(\"script\");a.type=\"text/javascript\";a.src=g;document.body&&document.body.appendChild(a)}};\r\nthis.start=function(){var t=this;\"complete\"!==document.readyState?window.addEventListener?window.addEventListener(\"load\",function(){t.go()},!1):window.attachEvent&&window.attachEvent(\"onload\",function(){t.go()}):t.go()};};\r\ntry{(new g(100,\"r\",\"QSI_S_ZN_5o5yqpvMVjgDOuN\",\"https://zn5o5yqpvmvjgdoun-wiley.siteintercept.qualtrics.com/SIE/?Q_ZID=ZN_5o5yqpvMVjgDOuN\")).start()}catch(i){}})();\r\n</script><div id='ZN_5o5yqpvMVjgDOuN'></div>\r\n","enabled":false},{"pages":["all"],"location":"header","script":"\r\n<script>\r\n (function(h,o,t,j,a,r){\r\n h.hj=h.hj||function(){(h.hj.q=h.hj.q||[]).push(arguments)};\r\n h._hjSettings={hjid:257151,hjsv:6};\r\n a=o.getElementsByTagName('head')[0];\r\n r=o.createElement('script');r.async=1;\r\n r.src=t+h._hjSettings.hjid+j+h._hjSettings.hjsv;\r\n a.appendChild(r);\r\n })(window,document,'https://static.hotjar.com/c/hotjar-','.js?sv=');\r\n</script>","enabled":false},{"pages":["article"],"location":"header","script":" <script src=\"//get.s-onetag.com/bffe21a1-6bb8-4928-9449-7beadb468dae/tag.min.js\" async defer></script>","enabled":true},{"pages":["homepage"],"location":"header","script":"<meta name=\"facebook-domain-verification\" content=\"irk8y0irxf718trg3uwwuexg6xpva0\" />","enabled":true},{"pages":["homepage","article","category","search"],"location":"footer","script":"\r\n<noscript>\r\n<img height=\"1\" width=\"1\" src=\"https://www.facebook.com/tr?id=256338321977984&ev=PageView&noscript=1\"/>\r\n</noscript>\r\n","enabled":true}]}},"pageScriptsLoadedStatus":"success"},"navigationState":{"navigationCollections":[{"collectionId":287568,"title":"BYOB (Be Your Own Boss)","hasSubCategories":false,"url":"/collection/for-the-entry-level-entrepreneur-287568"},{"collectionId":293237,"title":"Be a Rad Dad","hasSubCategories":false,"url":"/collection/be-the-best-dad-293237"},{"collectionId":295890,"title":"Career Shifting","hasSubCategories":false,"url":"/collection/career-shifting-295890"},{"collectionId":294090,"title":"Contemplating the Cosmos","hasSubCategories":false,"url":"/collection/theres-something-about-space-294090"},{"collectionId":287563,"title":"For Those Seeking Peace of Mind","hasSubCategories":false,"url":"/collection/for-those-seeking-peace-of-mind-287563"},{"collectionId":287570,"title":"For the Aspiring Aficionado","hasSubCategories":false,"url":"/collection/for-the-bougielicious-287570"},{"collectionId":291903,"title":"For the Budding Cannabis Enthusiast","hasSubCategories":false,"url":"/collection/for-the-budding-cannabis-enthusiast-291903"},{"collectionId":299891,"title":"For the College Bound","hasSubCategories":false,"url":"/collection/for-the-college-bound-299891"},{"collectionId":291934,"title":"For the Exam-Season Crammer","hasSubCategories":false,"url":"/collection/for-the-exam-season-crammer-291934"},{"collectionId":301547,"title":"For the Game Day Prepper","hasSubCategories":false,"url":"/collection/big-game-day-prep-made-easy-301547"}],"navigationCollectionsLoadedStatus":"success","navigationCategories":{"books":{"0":{"data":[{"categoryId":33512,"title":"Technology","hasSubCategories":true,"url":"/category/books/technology-33512"},{"categoryId":33662,"title":"Academics & The Arts","hasSubCategories":true,"url":"/category/books/academics-the-arts-33662"},{"categoryId":33809,"title":"Home, Auto, & Hobbies","hasSubCategories":true,"url":"/category/books/home-auto-hobbies-33809"},{"categoryId":34038,"title":"Body, Mind, & Spirit","hasSubCategories":true,"url":"/category/books/body-mind-spirit-34038"},{"categoryId":34224,"title":"Business, Careers, & Money","hasSubCategories":true,"url":"/category/books/business-careers-money-34224"}],"breadcrumbs":[],"categoryTitle":"Level 0 Category","mainCategoryUrl":"/category/books/level-0-category-0"}},"articles":{"0":{"data":[{"categoryId":33512,"title":"Technology","hasSubCategories":true,"url":"/category/articles/technology-33512"},{"categoryId":33662,"title":"Academics & The Arts","hasSubCategories":true,"url":"/category/articles/academics-the-arts-33662"},{"categoryId":33809,"title":"Home, Auto, & Hobbies","hasSubCategories":true,"url":"/category/articles/home-auto-hobbies-33809"},{"categoryId":34038,"title":"Body, Mind, & Spirit","hasSubCategories":true,"url":"/category/articles/body-mind-spirit-34038"},{"categoryId":34224,"title":"Business, Careers, & Money","hasSubCategories":true,"url":"/category/articles/business-careers-money-34224"}],"breadcrumbs":[],"categoryTitle":"Level 0 Category","mainCategoryUrl":"/category/articles/level-0-category-0"}}},"navigationCategoriesLoadedStatus":"success"},"searchState":{"searchList":[],"searchStatus":"initial","relatedArticlesList":[],"relatedArticlesStatus":"initial"},"routeState":{"name":"Article4","path":"/article/technology/information-technology/data-science/big-data/how-to-deal-with-duplicate-values-in-your-data-141224/","hash":"","query":{},"params":{"category1":"technology","category2":"information-technology","category3":"data-science","category4":"big-data","article":"how-to-deal-with-duplicate-values-in-your-data-141224"},"fullPath":"/article/technology/information-technology/data-science/big-data/how-to-deal-with-duplicate-values-in-your-data-141224/","meta":{"routeType":"article","breadcrumbInfo":{"suffix":"Articles","baseRoute":"/category/articles"},"prerenderWithAsyncData":true},"from":{"name":null,"path":"/","hash":"","query":{},"params":{},"fullPath":"/","meta":{}}},"dropsState":{"submitEmailResponse":false,"status":"initial"},"profileState":{"auth":{},"userOptions":{},"status":"success"}}

How to Deal with Duplicate Values in Your Data

By: Alan Anderson and David Semmelroth and

Updated: 03-26-2016

From The Book: Statistics for Big Data For Dummies

Statistics for Big Data For Dummies

Book image

Explore Book Buy On Amazon

Data is stored in different ways in different systems. So it's no surprise that when collecting and consolidating data from various sources, it's possible that duplicates pop up. In particular, what makes an individual record unique is different for different systems.

An investment account summary is attached to an account number. A portfolio summary might be stored at an individual or household level. And the trading histories of all those accounts are stored at the individual transaction level.

It's important to be clear about what is supposed to differentiate unique records in your data file. For example, if it's a transaction level file, then account numbers and household IDs will be duplicated. As long as you understand this and are doing a transaction level analysis, you will be fine.

But if you are interested in using this data to analyze the number of accounts held by each household, you will run into problems. The households that trade more frequently will have more records than those that don't trade very much. You need to have a file at the account level.

Removing duplicate records is not particularly difficult. Most statistical packages and database systems have built-in commands that group records together. (In fact, in the database language SQL, this command is called Group By.)

About This Article

This article is from the book:

Statistics for Big Data For Dummies ,

About the book authors:

Alan Anderson, PhD, is a professor of economics and finance at Fordham University and New York University. He's a veteran economist, risk manager, and fixed income analyst.

David Semmelroth is an experienced data analyst, trainer, and statistics instructor who consults on customer databases and database marketing.

This article can be found in the category:

Big Data ,