sachinv February 2016

Filter duplicated rows in R data.frame

I have a data.frame as shown below.

> df2 <- data.frame("StudentId" = c(1,1,1,2,2,3,3), "Subject" = c("Maths", "Maths", "English","Maths", "English", "Science", "Science"), "Score" = c(100,90,80,70, 60,20,10))
> df2
  StudentId Subject Score
1         1   Maths   100
2         1   Maths    90
3         1 English    80
4         2   Maths    70
5         2 English    60
6         3 Science    20
7         3 Science    10

Few StudentIds, have duplicated values for column Subject (example: ID 1 has 2 entries for "Maths". I need to keep only the first one of the duplicated rows. The expected data.frame is:

  StudentId Subject Score
1         1   Maths   100
3         1 English    80
4         2   Maths    70
5         2 English    60
6         3 Science    20

I am not able to do this. Any ideas.

Answers


akrun February 2016

We can either use unique from data.table with the by option after converting to 'data.table' (setDT(df2))

library(data.table)
unique(setDT(df2), by = c("StudentId", "Subject"))
#   StudentId Subject Score
#1:         1   Maths   100
#2:         1 English    80
#3:         2   Maths    70
#4:         2 English    60
#5:         3 Science    20

Or distinct from 'df2'

library(dplyr)
distinct(df2, StudentId, Subject)
#     StudentId Subject Score
#       (dbl)  (fctr) (dbl)
#1         1   Maths   100
#2         1 English    80
#3         2   Maths    70
#4         2 English    60
#5         3 Science    20

Or duplicated from base R

df2[!duplicated(df2[1:2]),]

EDIT: Based on suggestions by @David Arenburg)

Post Status

Asked in February 2016
Viewed 2,445 times
Voted 7
Answered 1 times

Search




Leave an answer