I have to compare two people datasets between each other.
Let's say i have a data frame with few columns a =.
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
1 | Mattias Adams | M | UK
2 | James Alan | M | Canada
3 | Dana Benton | F | USA
4 | Ella Collins | F | USA
And b =
ID | First_Name | Last_name | Third_name | Whole_name | Gender
————————————————————————————————————————————————————————————————————————————
1 | Gary | Cole | Allan | Gary Allan Cole | M
2 | Dana | Benton | NA | Dana Benton | F
3 | Lena | Jamison | Anne | Lena Anne Jamison | F
4 | Matt | King | NA | Matt King | M
Data frame a is the bigger one contains around 100,000 rows while b contains less than 1,000.
The goal is to use the data in b to find matching records in a. So that the whole row in a is returned if there is a match.
I want to try two ways. First to find exact matches from b$"Whole_name" in a$"Name".
Exact matching:
eue_wn <- as.character(b$"Whole_name")
eue_wn_match <- a[which(as.character(a$"Name") %in% eue_wn),]
if (nrow(eue_wn_match) == 0) {
eue_wn_match <- "No matches"
}
Output of eue_wn_matc in this case would be:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
Pattern matching:
eup_ln <- paste(as.character(b$"Last_name"), collapse = "|")
eup_fn <- paste(as.character(b$"First_Name"), collapse = "|")
eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")
eup_match <- a[which(grepl(eup_ln, as.character(a$"Name"), ignore.case = TRUE)),] #First filter (last name)
if (nrow(eup_match) == 0) {
eup_match <- "No matches"
}
if (nrow(eup_match) > 0) {
eup_match2 <- eup_match[which(grepl(eup_fn, as.character(eup_match$"Name"), ignore.case = TRUE)),] #Second filter (first name)
if (nrow(eup_match2) == 0 ) {
eup_match2 <- "No matches"
}
}
if (nrow(eup_match2) > 0) {
eup_match3 <- eup_match2[which(grepl(eup_tn, as.character(eup_match2$"Name"), ignore.case = TRUE)),] #Third filter (third_name)
if (nrow(eup_match3) == 0 ) {
eup_match3 <- "No matches"
}
}
So in this process the matching takes place in 3 stages.
First eup_match is the result of finding the last name. Than it takes that result and looks for second match which is the first name name results eup_match2 shows record that matches both conditions. Finally the last result is taken and is being matched also with the third name eup_match3
in this case the result of all three of them is the same:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
And that is incorrect. Only eup_match and eup_match2 should have that output. Since in the first stage we were matching Dana Benton(a) and Dana(b)
In the next stage the match was Dana Benton(a) and Benton (b). And since she does not have a third name it is impossible to match her with third name.
The problem is in:
eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")
The output off this is :
"Allan|NA|Anne|NA"
Because the NA was converted to character the function was able to find pattern in a and b. In this particular case Dana Benson (a) and NA (b)
Any idea on how to correct that ?
Another question is related to the output. Is there any way to output both results from a and b
Example: if we are only matching the a$Name with b$First_Name by patterns the result would be
ID | Name | Gender | Country | Match | Match ID
———————————————————————————————————————————————————————————————————————————
1 | Mattias Adams | M | UK | Matt | 4
3 | Dana Benton | F | USA | Dana | 2
So that the first 4 columns are from the data set a and the last two from b
Columns Match | Match ID would show based on what were the records in b matched.
The desired output for the test example given would be:
ID | Name | Gender | Country
——————————————————————————————————————————————————————————
3 | Dana Benton | F | USA
Sorry for the long post. I tried to make it as clear as possible. If anyone would like to recreate this, xlsx files a and b as well as the r code can be found here: MyDropbox
If anyone has other suggestions on how to approach this topic is welcome to present them. Thank you for the help.