我有一个数据集,其中包含属于不同群体的人员,一个人可以通过名称变体和不同的 person_id 被多次提及。我的计划是比较每个组内的 person_name 值并计算一个分数,让我了解相似性,这样我就可以将属于一个人的所有 person_id 放在一起。
这里是一个示例:
dt <- data.frame(
group_id = c(1,1,1,2,2,2,3,3,3),
person_id = c(1,2,3,4,4,5,6,7,8),
person_name = c("Smith, J.", "Areta Franklin", "John Smith", "Robert Mitchum", "Robert Mitchum", "Cary Grant", "John Rambo", "Martin Sheen", "Rambo John")
)
dt
group_id person_id person_name
1 1 1 Smith, J.
2 1 2 Areta Franklin
3 1 3 John Smith
4 2 4 Robert Mitchum
5 2 4 Robert Mitchum
6 2 5 Cary Grant
7 3 6 John Rambo
8 3 7 Martin Sheen
9 3 8 Rambo John
我尝试了以下方法
library(stringdist)
library(dplyr)
simil_names <- function(name1, name2) {
name1 <- gsub("[^A-z0-9 ]","",tolower(name1))
name2 <- gsub("[^A-z0-9 ]","",tolower(name2))
name1.v <- sort(unlist(strsplit(name1," ")))
name2.v <- sort(unlist(strsplit(name2," ")))
dist <- 0
if (length(name1.v) == length(name2.v)) {
# In this case we have the same number of words between the 2 strings
# Therefore we will check for each word if it's the same or if the first letter is the same
for(i in c(1:length(name1.v))) {
if ((nchar(name1.v[i]) == 1 || nchar(name2.v[i]) == 1) && nchar(name1.v[i]) + nchar(name2.v[i]) > 2) {
# Case of "j" in one string and "jeff" in the other string
# we compare only the first letter, if it's the same then the distance is 0, otherwise we compute the full Damerau-Levenshtein distance
if (substr(name1.v[i],1,1) != substr(name2.v[i],1,1)) {
dist <- dist + stringdist(name1.v[i],name2.v[i], method = 'dl')
}
} else {
# Case of 2 words of more than one letter
dist <- dist + stringdist(name1.v[i],name2.v[i], method = 'dl')
}
}
} else {
# Here we compare both strings in alphabetical order
dist.a <- stringdist(paste(name1.v, collapse = " "),paste(name2.v, collapse = " "), method = 'dl')
dist.n <- stringdist(name1,name2, method = 'dl')
dist <- min(c(dist.a,dist.n))
}
return(dist)
}
dt2 <- unique(inner_join(dt,dt, by = c("group_id"), multiple = "all") %>% filter(person_id.x > person_id.y) %>% mutate(score = simil_names(person_name.x, person_name.y)))
我期望得到这个结果
group_id person_id.x person_name.x person_id.y person_name.y score
1 1 2 Areta Franklin 1 Smith, J. 13
2 1 3 John Smith 1 Smith, J. 0
3 1 3 John Smith 2 Areta Franklin 13
4 2 5 Cary Grant 4 Robert Mitchum 11
5 3 7 Martin Sheen 6 John Rambo 10
6 3 8 Rambo John 6 John Rambo 0
7 3 8 Rambo John 7 Martin Sheen 10
但是,我明白了
group_id person_id.x person_name.x person_id.y person_name.y score
1 1 2 Areta Franklin 1 Smith, J. 64
2 1 3 John Smith 1 Smith, J. 64
3 1 3 John Smith 2 Areta Franklin 64
4 2 5 Cary Grant 4 Robert Mitchum 64
5 3 7 Martin Sheen 6 John Rambo 64
6 3 8 Rambo John 6 John Rambo 64
7 3 8 Rambo John 7 Martin Sheen 64
我不明白为什么,当我用 mutate 执行函数时,结果不同。 非常感谢您的帮助。
感谢@one 和@MrFlick 的答案和参考资料。
我最终使用
simil_names.v <- Vectorize(simil_names)
创建了一个函数,现在它可以按预期工作。