How to Remove Duplicate Data in R
A very useful application of subsetting data is to find and remove duplicate values. R has a useful function, duplicated(), that finds duplicate values and returns a logical vector that tells you whether the specific value is a duplicate of a previous value. This means that for duplicated values, duplicated() returns FALSE for the first occurrence and TRUE for every following occurrence of that value, as in the following example:
> duplicated(c(1,2,1,3,1,4)) [1] FALSE FALSE TRUE FALSE TRUE FALSE
If you try this on a data frame, R automatically checks the observations (meaning, it treats every row as a value). So, for example, with the data frame iris:
> duplicated(iris) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE .... [136] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE [145] FALSE FALSE FALSE FALSE FALSE FALSE
If you look carefully, you notice that row 143 is a duplicate (because the 143rd element of your result has the value TRUE). You also can tell this by using the which() function:
> which(duplicated(iris)) [1] 143
Now, to remove the duplicate from iris, you need to exclude this row from your data. Remember that there are two ways to exclude data using subsetting:
Specify a logical vector, where FALSE means that the element will be excluded. The ! (exclamation point) operator is a logical negation. This means that it converts TRUE into FALSE and vice versa. So, to remove the duplicates from iris, you do the following:
> iris[!duplicated(iris), ]
Specify negative values. In other words:
> index <- which(duplicated(iris)) > iris[-index, ]
In both cases, you’ll notice that your instruction has removed row 143.









