查找没有模式的文本字符串并返回在 R 中找到的文本

问题描述 投票:0回答:4

我有一个数据框,其中有一列包含文本字符串。在这些字符串中,我想提取某些关键词。它们可能在每个字符串中出现一次、多次或从不出现。如果找到这些关键词,我希望 R 返回一个包含这些关键词的新列。

下面是我的理论例子

#Opinions on Color
v <- c("red is cool", "I prefer blue", "yellow is better than blue", "orange is controversial", "what are colors", "sometimes I like pink and sometimes it's blue")

#Pull out Color Discussed
text <- paste0(c("red", "blue", "yellow", "orange", "pink"), collapse = '|')

What I expect:
[1] "red"    "blue"    "yellow,Blue"     "orange"    "NA"      "pink,blue"  

我一直在尝试使用grepl。我尝试了下面的代码,它只返回我想要的“红色”,但我正在努力让它返回所有颜色的独特颜色。

ifelse((grepl("red",v)), "red", "NA")

[1] "red" "NA"  "NA"  "NA"  "NA"  "NA"

我还尝试使用 if()else() 语句,但遇到错误“如果条件错误:如果条件的长度 >1”

if(grepl("red",v)){paste("red")
}else if(grepl("blue",v)){paste("blue")
}else{paste("NA")}`

我最后的想法是试图找到一种方法来定位字符串中关键字的位置,然后在该位置提取单词,但我还没有找到一种优雅的方法。

有什么建议吗?

r grepl
4个回答
4
投票

可能的解决方案:

v <- c("red is cool", "I prefer blue", "yellow is better than blue", 
       "orange is controversial", "what are colors", 
       "sometimes I like pink and sometimes it's blue")

matches <- gregexpr("red|blue|yellow|orange|pink", v)

sapply(regmatches(v, matches), \(x) if(length(x)) paste0(x, collapse=", ") else NA) 
#> [1] "red"          "blue"         "yellow, blue" "orange"       NA            
#> [6] "pink, blue"

2
投票

在正则表达式中添加单词边界标记,例如,“red”与“fred”不匹配,然后使用

strapplyc
提取匹配项并使用
toString
将它们组合成逗号分隔的字符串。最后将 "" 转换为 NA.

library(gsubfn)

pat <- paste0("\\b(", text, ")\\b")
strapplyc(v, pat, engine = "R") |> 
  sapply(toString) |> 
  sub("^$", NA, x = _)

## [1] "red"          "blue"         "yellow, blue" "orange"       NA            
## [6] "pink, blue"  

2
投票

在 Base R 中,使用正则表达式:

sapply(`is.na<-`(x <- regmatches(v, gregexpr(text, v)), !lengths(x)), toString)
[1] "red"          "blue"         "yellow, blue" "orange"       "NA"          
[6] "pink, blue"  

另一种选择

pat <- sprintf("(\\b ?(?!%s)\\w+\\W*)+", text)
x <- trimws(gsub(pat, ",", v, perl=TRUE),,',')
is.na(x) <- !nzchar(x)
x
#> [1] "red"         "blue"        "yellow,blue" "orange"      NA           
#> [6] "pink,blue"

创建于 2023-02-22 与 reprex v2.0.2


1
投票

我们可以提取想要的词,然后用

toString
粘贴,最后用NAs替换空字符

library(purrr)
library(stringr)

v %>%
    map_vec(str_extract_all, text) %>%
    map_vec(toString) %>%
    replace(., . == "", NA)

[1] "red"          "blue"         "yellow, blue"
[4] "orange"       NA             "pink, blue" 
© www.soinside.com 2019 - 2024. All rights reserved.