在r的不同数据帧中匹配两个以上单词的单词

问题描述 投票:1回答:1

我有两个这样的数据帧DF1和DF2。

ID = c(1, 2, 3, 4) 
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.frame(ID, Issues, Location, Customer)

Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')  
DF2 = data.frame(Root_Cause, List_of_Issues)

我想将两个数据帧分别与DF1的“问题”和DF2的“ List_of_Issues”进行比较,如果DF2的“ List_of_Issues”列中有两个以上的单词,那么我想填充后续的“ Root_Cause”。我得到的数据帧应该看起来像DF3。

ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
Root_Cause = c('R2', 'R4', NA, 'R1')
DF3 = data.frame(ID, Issues, Location, Customer, Root_Cause)
r dataframe textmatching
1个回答
0
投票

使用data.table:

require(data.table)
setDT(DF1)
setDT(DF2)

match_dir = sapply(DF2[, List_of_Issues], function(y) strsplit(y, split = ', '))

DF1[, Root_Cause := sapply(1:nrow(DF1), function(x){

  issues = unlist(DF1[, strsplit(Issues[x], split = ', ')]) # Split issues
  matchvec = sapply(match_dir, function(z) length(intersect(issues, z))) # Match issues against List_of_Issues in DF2

  out = ifelse(max(matchvec) > 1, DF2[, Root_Cause[which.max(matchvec)]], NA)
  return(out)

})]

结果

> DF1
   ID                 Issues Location Customer Root_Cause
1:  1         Issue1, Issue4        x        a         R2
2:  2 Issue2, Issue5, Issue6        y        b         R4
3:  3         Issue3, Issue4        z        c       <NA>
4:  4         Issue1, Issue5        w        d         R1

我确信有一个更优雅的解决方案,但这就是我从头开始想到的。让我考虑一会儿,如果我建立一个更有效的结构,请进行修改。

© www.soinside.com 2019 - 2024. All rights reserved.