我有两个这样的数据帧DF1和DF2。
ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.frame(ID, Issues, Location, Customer)
Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')
DF2 = data.frame(Root_Cause, List_of_Issues)
我想将两个数据帧分别与DF1的“问题”和DF2的“ List_of_Issues”进行比较,如果DF2的“ List_of_Issues”列中有两个以上的单词,那么我想填充后续的“ Root_Cause”。我得到的数据帧应该看起来像DF3。
ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
Root_Cause = c('R2', 'R4', NA, 'R1')
DF3 = data.frame(ID, Issues, Location, Customer, Root_Cause)
使用data.table:
require(data.table)
setDT(DF1)
setDT(DF2)
match_dir = sapply(DF2[, List_of_Issues], function(y) strsplit(y, split = ', '))
DF1[, Root_Cause := sapply(1:nrow(DF1), function(x){
issues = unlist(DF1[, strsplit(Issues[x], split = ', ')]) # Split issues
matchvec = sapply(match_dir, function(z) length(intersect(issues, z))) # Match issues against List_of_Issues in DF2
out = ifelse(max(matchvec) > 1, DF2[, Root_Cause[which.max(matchvec)]], NA)
return(out)
})]
结果
> DF1
ID Issues Location Customer Root_Cause
1: 1 Issue1, Issue4 x a R2
2: 2 Issue2, Issue5, Issue6 y b R4
3: 3 Issue3, Issue4 z c <NA>
4: 4 Issue1, Issue5 w d R1
我确信有一个更优雅的解决方案,但这就是我从头开始想到的。让我考虑一会儿,如果我建立一个更有效的结构,请进行修改。