-2

this should be simple but it's got me pulling my hair out!

Here is some data:

Clicks <- c(1,2,3,4,5,6,5,4,3,2)
Cost <- c(10,11,12,13,14,15,14,13,12,11)
Cluster <- c(1,1,1,2,2,1,1,1,1,1)
df <- data.frame(Clicks,Cost,Cluster)

I want to filter my df by cluster, assign a new vector that assigns "test" and "control" group at random, then recombine to the original data frame

Step 1: Filter (by cluster 1)

  Clicks Cost Cluster
1      1   10       1
2      2   11       1
3      3   12       1
4      6   15       1
5      5   14       1
6      4   13       1
7      3   12       1
8      2   11       1

Step 2: Assign test and control group at random

  Clicks Cost Cluster   group
1      1   10       1    Test
2      2   11       1 Control
3      3   12       1 Control
4      6   15       1    Test
5      5   14       1 Control
6      4   13       1 Control
7      3   12       1    Test
8      2   11       1 Control

Step 3: Get back to the original data frame

   Clicks Cost Cluster   group
1       1   10       1    Test
2       2   11       1 Control
3       3   12       1 Control
4       4   13       2    NULL
5       5   14       2    NULL
6       6   15       1    Test
7       5   14       1 Control
8       4   13       1 Control
9       3   12       1    Test
10      2   11       1 Control

Step 4: do the same for cluster 2

Thanks :)

Shinobi_Atobe
  • 1,793
  • 1
  • 18
  • 35
  • 1
    If the elements of group are assigned at random for both Clusters, why do you need to split them first? – aichao Sep 13 '16 at 19:59
  • aichao is correct - if you are assigning with 50% probability for test or control, it won't matter if you split first: `df$group = ifelse(runif(nrow(df)) < 0.5, 'test', 'control')`. If you want even splits within each group then something like `library(dplyr); group_by(df, Cluster) %>% mutate(draw = runif(n()), group = ifelse(draw < median(draw), 'test', 'control'))` – Gregor Thomas Sep 13 '16 at 20:05
  • @Gregor, you guessed it right, equal numbers in both groups. Your solution works perfectly, could you perhaps explain what it's doing? – Shinobi_Atobe Sep 14 '16 at 07:12

1 Answers1

0

How about

df$Group <- 'NULL'

df1 <- df
df1[df1$Cluster==1, ]$Group <- ifelse(runif(sum(df1$Cluster==1)) > 0.5, 'Control', 'Test')
df1
   Clicks Cost Cluster   Group
1       1   10       1    Test
2       2   11       1    Test
3       3   12       1    Test
4       4   13       2    NULL
5       5   14       2    NULL
6       6   15       1 Control
7       5   14       1    Test
8       4   13       1    Test
9       3   12       1 Control
10      2   11       1 Control

df2 <- df
df2[df2$Cluster==2, ]$Group <- ifelse(runif(sum(df2$Cluster==2)) > 0.5, 'Control', 'Test')
df2
 Clicks Cost Cluster   Group
1       1   10       1    NULL
2       2   11       1    NULL
3       3   12       1    NULL
4       4   13       2    Test
5       5   14       2 Control
6       6   15       1    NULL
7       5   14       1    NULL
8       4   13       1    NULL
9       3   12       1    NULL
10      2   11       1    NULL
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63