如果我使用 mutate 运行函数会得到不同的结果

问题描述 投票:0回答:1

我有一个数据集,其中包含属于不同群体的人员,一个人可以通过名称变体和不同的 person_id 被多次提及。我的计划是比较每个组内的 person_name 值并计算一个分数,让我了解相似性,这样我就可以将属于一个人的所有 person_id 放在一起。

这里是一个示例:

dt <- data.frame(
  group_id = c(1,1,1,2,2,2,3,3,3),
  person_id = c(1,2,3,4,4,5,6,7,8),
  person_name = c("Smith, J.", "Areta Franklin", "John Smith", "Robert Mitchum", "Robert Mitchum", "Cary Grant", "John Rambo", "Martin Sheen", "Rambo John")
)

dt

  group_id person_id    person_name
1        1         1      Smith, J.
2        1         2 Areta Franklin
3        1         3     John Smith
4        2         4 Robert Mitchum
5        2         4 Robert Mitchum
6        2         5     Cary Grant
7        3         6     John Rambo
8        3         7   Martin Sheen
9        3         8     Rambo John

我尝试了以下方法

library(stringdist)
library(dplyr)

simil_names <- function(name1, name2) {
  name1 <- gsub("[^A-z0-9 ]","",tolower(name1))
  name2 <- gsub("[^A-z0-9 ]","",tolower(name2))
  name1.v <- sort(unlist(strsplit(name1," ")))
  name2.v <- sort(unlist(strsplit(name2," ")))
  dist <- 0
  if (length(name1.v) == length(name2.v)) {
    # In this case we have the same number of words between the 2 strings
    # Therefore we will check for each word if it's the same or if the first letter is the same
    for(i in c(1:length(name1.v))) {
      if ((nchar(name1.v[i]) == 1 || nchar(name2.v[i]) == 1) && nchar(name1.v[i]) + nchar(name2.v[i]) > 2) {
        # Case of "j" in one string and "jeff" in the other string
        # we compare only the first letter, if it's the same then the distance is 0, otherwise we compute the full Damerau-Levenshtein distance
        if (substr(name1.v[i],1,1) != substr(name2.v[i],1,1)) {
          dist <- dist + stringdist(name1.v[i],name2.v[i], method = 'dl')
        }
      } else {
        # Case of 2 words of more than one letter
        dist <- dist + stringdist(name1.v[i],name2.v[i], method = 'dl')
      }
    }
  } else {
    # Here we compare both strings in alphabetical order
    dist.a <- stringdist(paste(name1.v, collapse = " "),paste(name2.v, collapse = " "), method = 'dl')
    dist.n <- stringdist(name1,name2, method = 'dl')
    dist <- min(c(dist.a,dist.n))
  }
  return(dist)
}

dt2 <- unique(inner_join(dt,dt, by = c("group_id"), multiple = "all") %>% filter(person_id.x > person_id.y) %>% mutate(score = simil_names(person_name.x, person_name.y)))

我期望得到这个结果

  group_id person_id.x  person_name.x person_id.y  person_name.y score
1        1           2 Areta Franklin           1      Smith, J.    13
2        1           3     John Smith           1      Smith, J.     0
3        1           3     John Smith           2 Areta Franklin    13
4        2           5     Cary Grant           4 Robert Mitchum    11
5        3           7   Martin Sheen           6     John Rambo    10
6        3           8     Rambo John           6     John Rambo     0
7        3           8     Rambo John           7   Martin Sheen    10

但是,我明白了

  group_id person_id.x  person_name.x person_id.y  person_name.y score
1        1           2 Areta Franklin           1      Smith, J.    64
2        1           3     John Smith           1      Smith, J.    64
3        1           3     John Smith           2 Areta Franklin    64
4        2           5     Cary Grant           4 Robert Mitchum    64
5        3           7   Martin Sheen           6     John Rambo    64
6        3           8     Rambo John           6     John Rambo    64
7        3           8     Rambo John           7   Martin Sheen    64

我不明白为什么,当我用 mutate 执行函数时,结果不同。 非常感谢您的帮助。

r dplyr
1个回答
0
投票

感谢@one 和@MrFlick 的答案和参考资料。

我最终使用

simil_names.v <- Vectorize(simil_names)
创建了一个函数,现在它可以按预期工作。

© www.soinside.com 2019 - 2024. All rights reserved.