鉴于以下两个向量,有没有一种方法可以产生所需的数据帧?这代表了一个现实世界的情况,我必须对数据帧进行数据处理,第一个包含一个带有数据库值(键)的列,第二个包含一个包含1000多个行的列,每个列都需要匹配一个文件名(电位)。问题在于,可能有多个文件(电位)与任何给定密钥匹配。我曾经使用过grep,merge,inner join等,但是无法将它们合并到一个解决方案中。任何建议表示赞赏!
potentials <- c("tigerINTHENIGHT",
"tigerWALKINGALONE",
"bearOHMY",
"bearWITHME",
"rat",
"imatchnothing")
keys <- c("tiger",
"bear",
"rat")
desired <- data.frame(keys, c("tigerINTHENIGHT, tigerWALKINGALONE", "bearOHMY, bearWITHME", "rat"))
names(desired) <- c("key", "matches")
我认为是解决方案的伪代码:
#new column which is comma separated potentials
# x being the substring length i.e. x = 4 means true if first 4 letters match
function createNewColumn(keys, potentials, x){
str result = na
foreach(key in keys){
if(substring(key, 0, x) == any(substring(potentals, 0 ,x))){ //search entire potential vector
result += potential that matched + ', '
}
}
return new column with result as the value on the current row
}
您可以使用grep
进行交互
> Match <- sapply(keys, function(item) {
paste0(grep(item, potentials, value = TRUE), collapse = ", ")
} )
> data.frame(keys, Match, row.names = NULL)
keys Match
1 tiger tigerINTHENIGHT, tigerWALKINGALONE
2 bear bearOHMY, bearWITHME
3 rat rat
我们可以编写一个小的函数来提取匹配项,然后在键上循环:
return_matches <- function(keys, potentials, fixed = TRUE) {
vapply(keys, function(k) {
paste(grep(k, potentials, value = TRUE, fixed = fixed), collapse = ", ")
}, FUN.VALUE = character(1))
}
vapply
只是sapply
的类型安全版本,表示除字符向量外,它将永远不会返回任何内容。当您设置fixed = TRUE
时,该函数将运行得更快,但不再识别正则表达式。然后,我们可以轻松地制作所需的data.frame
:
df <- data.frame(
key = keys,
matches = return_matches(keys, potentials),
stringsAsFactors = FALSE
)
df
#> key matches
#> tiger tiger tigerINTHENIGHT, tigerWALKINGALONE
#> bear bear bearOHMY, bearWITHME
#> rat rat rat
将循环放入函数而不是直接运行循环的原因只是为了使代码看起来更简洁。