Thursday, 12 September 2013

Partial duplicates in r

Partial duplicates in r

I am working with a big dataframe, that looks like this one
> dput(mydata)
structure(list(X = structure(c(3L, 4L, 7L, 8L, 6L, 5L, 1L, 2L
), .Label = c("ABCDE.no.m-1", "ABCDE.no.m-5", "IOJRE.no.m-1",
"IOJRE.no.m-2", "OHIJR.no.m-1", "OHIJR.no.m-3", "OHIJR.no.m-4",
"OHIJR.no.m-6"), class = "factor"), V1 = c(10, 10, 5, 5, 3, 2,
5, 1), V2 = c(2, 7, 5, 6, 3, 3, 6, 4), V3 = c(0, 0, 1, 1, 1,
1, 0, 0)), .Names = c("X", "V1", "V2", "V3"), row.names = c(NA,
8L), class = "data.frame")
The first 5 letters in many items in the first column are similar. I am
trying to subset the data, keeping only one of these similar items, which
has the highest V1 value. (if the V1 value is the same, it doesn't matter
which one to keep for me).
I cannot think of a proper command because this is not a duplicated(). I
thought of using aggregate() then which.max() But I really cannot
construct a proper command for the job. Here's the output I am looking to
have.
> dput(mydata2)
structure(list(X = structure(c(2L, 3L, 1L), .Label = c("ABCDE.no.m-1",
"IOJRE.no.m-1", "OHIJR.no.m-4"), class = "factor"), V1 = c(10L,
5L, 5L), V2 = c(2L, 5L, 6L), V3 = c(0L, 1L, 0L)), .Names = c("X",
"V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -3L
))
Can anyone help me with this?
Many thanks,

No comments:

Post a Comment