使用正则表达式模式提取带有关键字的句子

问题描述 投票:0回答:2

我有一个在

data frame
中查找匹配项的功能(忽略 t2 行,其“关闭”)

library(stringr)

find.all.matches <- function(search.col,pattern){
  captured <- str_match_all(search.col,pattern = pattern)
  t <- lapply(captured, str_trim)
  #t2 <- lapply(t, function(x) gsub("[^a-z]","",x)) ##turned off
  t3 <- sapply(t, unique)
  t4 <- lapply(t3, toString)
  found.col <- unlist(t4)
  return(found.col)
}

我正在约 20,000 行的大型数据集中的特定列上运行此代码。该专栏是科学期刊的摘要。

我使用以下代码将从

pattern
中提取的单词添加为数据框中的新列

testing2 <- find.all.matches(search.col = all_data$abstract_l, 
                             pat = pattern)

all_data$testing_mu_m <- testing2

这是当前模式......

pattern = '\\d+(?:[.,]\\d+)*\\s*mu m\\b|ba\\b'

这将突出显示以下示例摘要中

mu m
以及
ba
之前的所有数字

a protocol for in vitro propagation of adult lavandula dentata plants has been achieved. cultures were established by placing nodal segments on murashige and skoog medium containing ba, kin, and naa. highest shoot multiplication rates were obtained when explants grown in the presence of 5.0 mu m ba or 20 mu m kin were transferred to medium with 8.8 mu m ba and 15% coconut milk. multiplication efficiency through subcultures was significantly affected by the cytokinin concentration in the initial culture medium. subculture reduced drastically the final number of shoots produced on nodal segments isolated from shoots grown in the presence of 2.0 mu m ba or 40.0 mu m kin. shoots were easily rooted on murashige and skoog hormone-free medium with macronutrients at half-strength. plants were successfully transplanted into soil. 

我想知道,有没有办法拉出包含

ba
的整个句子? 我想要一个可以插入到
pattern
函数中的
find.all.matches
。 所需输出:
cultures were established by placing nodal segments on murashige and skoog medium containing ba, kin, and naa
AND
highest shoot multiplication rates were obtained when explants grown in the presence of 5.0 mu m ba or 20 mu m kin were transferred to medium with 8.8 mu m ba and 15% coconut milk
AND
subculture reduced drastically the final number of shoots produced on nodal segments isolated from shoots grown in the presence of 2.0 mu m ba or 40.0 mu m kin.

r regex stringr
2个回答
0
投票

您可以使用此正则表达式来匹配包含

ba
:

的整个句子
(?<=^|\. )(?:(?!\.(?: |$)).)*?\bba\b.*?\.(?= |$)

它匹配:

  • (?<=^|\. )
    :句子的开头(字符位置前面是字符串开头或
    . 
  • (?:(?!\.(?: |$)).)*?
    :最小数量的字符,其中没有一个是
    .
    后跟空格或字符串结尾(调和的贪婪标记
  • \bba\b
    :单词
    ba
  • .*?\.(?= |$)
    :最少数量的字符,后跟
    .
    以及空格或字符串结尾。

regex101 上的正则表达式演示


0
投票
还有另一种方法:

find.all.matches <- function(search.col,pattern){ sentences <- str_split(example, "[\\.!?][ |$]", simplify = TRUE) captured <- str_subset(sentences, pattern) t <- lapply(captured, str_trim) #t2 <- lapply(t, function(x) gsub("[^a-z]","",x)) ##turned off t3 <- sapply(t, unique) t4 <- lapply(t3, toString) found.col <- unlist(t4) return(found.col) } example <- "a protocol for in vitro propagation of adult lavandula dentata plants has been achieved. cultures were established by placing nodal segments on murashige and skoog medium containing ba, kin, and naa. highest shoot multiplication rates were obtained when explants grown in the presence of 5.0 mu m ba or 20 mu m kin were transferred to medium with 8.8 mu m ba and 15% coconut milk. multiplication efficiency through subcultures was significantly affected by the cytokinin concentration in the initial culture medium. subculture reduced drastically the final number of shoots produced on nodal segments isolated from shoots grown in the presence of 2.0 mu m ba or 40.0 mu m kin. shoots were easily rooted on murashige and skoog hormone-free medium with macronutrients at half-strength. plants were successfully transplanted into soil." find.all.matches(example,"\\bba\\b")
    
© www.soinside.com 2019 - 2024. All rights reserved.