Assigning unique id to duplicated rows

Question

If i have a data frame which looks like this:

x y
13 a
14 b
15 c
15 c
14 b

and I wanted each group of equal rows to have a unique id, like this:

x y id
13 a 1
14 b 2
15 c 3
15 c 3
14 b 2

Is there any easy way of doing this?

Thanks

Is your example overly simplistic or does it contain typo, as here id is exactly the same as x? — Jouni Helske, Mar 08 '13 at 20:49
A similar question using `data.table`: http://stackoverflow.com/questions/13018696/data-table-key-indices-or-group-counter — flodel, Mar 08 '13 at 20:57
@by0 check my improved solution which uses `interaction` function instead of `paste0`. — Jouni Helske, Mar 08 '13 at 21:40

score 4 · Answer 1 · edited May 23 '17 at 12:07

4

I have a bit of a concern with the paste0 approach. If your columns contained more complex data, you could end up with surprising results, e.g. imagine:

 x  y
ab  c
 a bc

One solution is to replace paste0(...) with paste(..., sep = "@"). Even so, you cannot come up with a sep general enough that it will work with any type of data as there is always a non-zero probability that sep will be contained in some kind of data.

A more robust approach is to use a split/transform/combine approach. You can certainly do it with the base package but plyr makes it a bit easier:

library(plyr)
.idx <- 0L
ddply(df, colnames(df), transform, id = (.idx <<- .idx + 1L))

If this is too slow, I would recommend a data.table approach, as proposed here: data.table "key indices" or "group counter"

edited May 23 '17 at 12:07

Community

1
1

answered Mar 08 '13 at 21:27

flodel

87,577
21
185
223

Good point about `paste0`, I added a better solution which is actually more neater than the original answer. – Jouni Helske Mar 08 '13 at 21:39
@Hemmo. I think using `interaction` is equivalent to using `paste(..., sep = '.')`; theoretically, it suffers the same (unlikely) problem I was discussing. – flodel Mar 08 '13 at 21:43
Oh yes, you are right that they actually produce the same thing, but they actually both work correctly in situation you discussed, as you get `ab.c` and `a.bc` which are distinct. And I guess that is what is wanted. `paste0` doesn't work properly though (that works if separation is not wanted). – Jouni Helske Mar 08 '13 at 21:48

Jouni Helske · Accepted Answer · 2013-03-08T21:49:49.067

3

This is the first thing I thought:

Make a new variable which just combines the two columns by pasting their values to strings:

a<-paste0(z$x,z$y) #z is your data.frame

The make this as a factor and combine it to your dataframe:

cbind(z,id=factor(a,labels=1:length(unique(a))))

EDIT: @flodel was concerned about using paste0, it's better to use ordinary paste, or interaction:

a<-interaction(z,drop=TRUE)
cbind(z,id=factor(a,labels=1:length(unique(a))))

This is assuming that you want to separate x=ab, y=c, and x=a,y=bc. If not, then use paste0.

edited Mar 08 '13 at 21:49

answered Mar 08 '13 at 20:53

Jouni Helske

6,427
29
52

2

(+1) I'd change `a` to `do.call(paste0, z)`. And `1:length(unique(a))` to `seq_along(unique(a))` – Arun Mar 08 '13 at 20:59
Good points, I never remember the `seq_along` and `do.call` was new to me. Thanks. – Jouni Helske Mar 08 '13 at 21:06

Assigning unique id to duplicated rows

2 Answers2