字符串模式查找多个匹配的功能

问题描述 投票:0回答:1

我有一个包含约 20,0000 个观察值的数据框。我特别关注包含科学期刊摘要的专栏。我正在尝试从这些摘要中提取植物物种名称。

属已经从摘要中提取出来,因此我可以使用属来查找物种名称,因为物种名称将直接跟在摘要中的属之后(例如属物种)。我遇到的问题是,这些文章中有数千个属,并创建了一个典型的

pattern
,例如......

pattern = Malus|Gentiana|Acer|Quercus 

环顾四周,寻找数千个属是不合理的。我想知道,有没有一种方法(也许是一个函数)可以在

pattern
中保留后续查看并替换为属(它们目前作为单列
data frame
)来拉出匹配项?

我想要什么... 摘要中出现的例句

前 1.

axillary bud cultures were initiated from 3 types of nodal explants of lagerstroemia parviflora. 

前2.

the influence of headspace ethylene on anthocyanin, anthocyanidin, and carotenoid accumulation was studied in suspension cultures of vaccinium pahalae.

我想要一个向前看

lagerstroemia
vaccinium
的模式,以获得
lagerstroemia parviflora
vaccinum pahalae
,并对1000个其他属执行此操作,以提取“属物种”的格式

r regex stringr stringi
1个回答
0
投票

假设每个句子都有一个特定的属,比如

ex <- data.frame(sentence = c("axillary bud cultures were initiated from 3 types of nodal explants of lagerstroemia parviflora.", "the influence of headspace ethylene on anthocyanin, anthocyanidin, and carotenoid accumulation was studied in suspension cultures of vaccinium pahalae."), genus = c("Lagerstroemia", "Vaccinium"))
str(ex)
# 'data.frame': 2 obs. of  2 variables:
#  $ sentence: chr  "axillary bud cultures were initiated from 3 types of nodal explants of lagerstroemia parviflora." "the influence of headspace ethylene on anthocyanin, anthocyanidin, and carotenoid accumulation was studied in s"| __truncated__
#  $ genus   : chr  "Lagerstroemia" "Vaccinium"

然后您可以

Map
(或
purrr::map
或类似的)找到所有“属种”对,假设它们总是由一个空格分隔。

stringr::str_extract_all(ex$sentence, regex(paste0(ex$genus, " \\S+"), ignore_case = TRUE))
# [[1]]
# [1] "lagerstroemia parviflora."
# [[2]]
# [1] "vaccinium pahalae."

如果你确定不超过一个,你可以这样做

stringr::str_extract(ex$sentence, regex(paste0(ex$genus, " \\S+"), ignore_case = TRUE))
# [1] "lagerstroemia parviflora." "vaccinium pahalae."       

注意

str_extract_all
更改为-
str_extract
。如果未找到物种,这将返回
NA

由于

stringr::str_extract*
已矢量化,因此应该可以很好地扩展。

© www.soinside.com 2019 - 2024. All rights reserved.