我有一个数据框,其中有一列包含文本字符串。在这些字符串中,我想提取某些关键词。它们可能在每个字符串中出现一次、多次或从不出现。如果找到这些关键词,我希望 R 返回一个包含这些关键词的新列。
下面是我的理论例子
#Opinions on Color
v <- c("red is cool", "I prefer blue", "yellow is better than blue", "orange is controversial", "what are colors", "sometimes I like pink and sometimes it's blue")
#Pull out Color Discussed
text <- paste0(c("red", "blue", "yellow", "orange", "pink"), collapse = '|')
What I expect:
[1] "red" "blue" "yellow,Blue" "orange" "NA" "pink,blue"
我一直在尝试使用grepl。我尝试了下面的代码,它只返回我想要的“红色”,但我正在努力让它返回所有颜色的独特颜色。
ifelse((grepl("red",v)), "red", "NA")
[1] "red" "NA" "NA" "NA" "NA" "NA"
我还尝试使用 if()else() 语句,但遇到错误“如果条件错误:如果条件的长度 >1”
if(grepl("red",v)){paste("red")
}else if(grepl("blue",v)){paste("blue")
}else{paste("NA")}`
我最后的想法是试图找到一种方法来定位字符串中关键字的位置,然后在该位置提取单词,但我还没有找到一种优雅的方法。
有什么建议吗?
可能的解决方案:
v <- c("red is cool", "I prefer blue", "yellow is better than blue",
"orange is controversial", "what are colors",
"sometimes I like pink and sometimes it's blue")
matches <- gregexpr("red|blue|yellow|orange|pink", v)
sapply(regmatches(v, matches), \(x) if(length(x)) paste0(x, collapse=", ") else NA)
#> [1] "red" "blue" "yellow, blue" "orange" NA
#> [6] "pink, blue"
在正则表达式中添加单词边界标记,例如,“red”与“fred”不匹配,然后使用
strapplyc
提取匹配项并使用 toString
将它们组合成逗号分隔的字符串。最后将 "" 转换为 NA.
library(gsubfn)
pat <- paste0("\\b(", text, ")\\b")
strapplyc(v, pat, engine = "R") |>
sapply(toString) |>
sub("^$", NA, x = _)
## [1] "red" "blue" "yellow, blue" "orange" NA
## [6] "pink, blue"
在 Base R 中,使用正则表达式:
sapply(`is.na<-`(x <- regmatches(v, gregexpr(text, v)), !lengths(x)), toString)
[1] "red" "blue" "yellow, blue" "orange" "NA"
[6] "pink, blue"
另一种选择
pat <- sprintf("(\\b ?(?!%s)\\w+\\W*)+", text)
x <- trimws(gsub(pat, ",", v, perl=TRUE),,',')
is.na(x) <- !nzchar(x)
x
#> [1] "red" "blue" "yellow,blue" "orange" NA
#> [6] "pink,blue"
创建于 2023-02-22 与 reprex v2.0.2
我们可以提取想要的词,然后用
toString
粘贴,最后用NAs替换空字符
library(purrr)
library(stringr)
v %>%
map_vec(str_extract_all, text) %>%
map_vec(toString) %>%
replace(., . == "", NA)
[1] "red" "blue" "yellow, blue"
[4] "orange" NA "pink, blue"