字符串模式查找多个匹配的功能

Question

我有一个包含约 20,0000 个观察值的数据框。我特别关注包含科学期刊摘要的专栏。我正在尝试从这些摘要中提取植物物种名称。

属已经从摘要中提取出来，因此我可以使用属来查找物种名称，因为物种名称将直接跟在摘要中的属之后（例如属物种）。我遇到的问题是，这些文章中有数千个属，并创建了一个典型的

pattern

，例如......

pattern = Malus|Gentiana|Acer|Quercus

环顾四周，寻找数千个属是不合理的。我想知道，有没有一种方法（也许是一个函数）可以在

pattern

中保留后续查看并替换为属（它们目前作为单列

data frame

）来拉出匹配项？

我想要什么... 摘要中出现的例句

前 1.

axillary bud cultures were initiated from 3 types of nodal explants of lagerstroemia parviflora.

前2.

the influence of headspace ethylene on anthocyanin, anthocyanidin, and carotenoid accumulation was studied in suspension cultures of vaccinium pahalae.

我想要一个向前看

lagerstroemia

和

vaccinium

的模式，以获得

lagerstroemia parviflora

和

vaccinum pahalae

，并对1000个其他属执行此操作，以提取“属物种”的格式

Answer 1

假设每个句子都有一个特定的属，比如

ex <- data.frame(sentence = c("axillary bud cultures were initiated from 3 types of nodal explants of lagerstroemia parviflora.", "the influence of headspace ethylene on anthocyanin, anthocyanidin, and carotenoid accumulation was studied in suspension cultures of vaccinium pahalae."), genus = c("Lagerstroemia", "Vaccinium"))
str(ex)
# 'data.frame': 2 obs. of  2 variables:
#  $ sentence: chr  "axillary bud cultures were initiated from 3 types of nodal explants of lagerstroemia parviflora." "the influence of headspace ethylene on anthocyanin, anthocyanidin, and carotenoid accumulation was studied in s"| __truncated__
#  $ genus   : chr  "Lagerstroemia" "Vaccinium"

然后您可以

Map

（或

purrr::map

或类似的）找到所有“属种”对，假设它们总是由一个空格分隔。

stringr::str_extract_all(ex$sentence, regex(paste0(ex$genus, " \\S+"), ignore_case = TRUE))
# [[1]]
# [1] "lagerstroemia parviflora."
# [[2]]
# [1] "vaccinium pahalae."

如果你确定不超过一个，你可以这样做

stringr::str_extract(ex$sentence, regex(paste0(ex$genus, " \\S+"), ignore_case = TRUE))
# [1] "lagerstroemia parviflora." "vaccinium pahalae."

注意

str_extract_all

更改为-

str_extract

。如果未找到物种，这将返回

NA

。

由于

stringr::str_extract*

已矢量化，因此应该可以很好地扩展。

字符串模式查找多个匹配的功能

问题描述投票：0回答：1

1个回答

最新问题

字符串模式查找多个匹配的功能

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1