当R中出现分号时,如何从数据框中的一列中提取匹配的值?

问题描述 投票:0回答:1

我有一个大型的已发表文章的数据框架,我想提取与少数作者有关的所有文章,并将其指定在一个单独的列表中。数据框中的作者被分组在一列中,用 ;分隔。并非所有的作者都需要匹配,我想提取任何有一个作者与列表匹配的文章。下面是一个例子。

Title<-c("A", "B", "C")

AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")

df<-cbind(Title, AU)

authors<-as.character(c("Mark", "John", "Luke"))

df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]

我希望能返回。

Title   AU
  A      Mark; John                          
  C      Sarah; Luke

然而在我的大数据框架中,这个命令并不能返回所有的AU,它只返回有一个AU的行,而不是多个AU。

下面是我的大数据框架中的5条记录的dput

structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS", 
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE", 
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND", 
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA", 
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS", 
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY", 
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.", 
"FUEL", "AGRICULTURAL HISTORY")
r subset text-mining
1个回答
0
投票

一个带有 str_extract

library(dplyr)
library(stringr)
df %>%
   mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>% 
   filter(lengths(Names) > 0)
#  Title      Names
#1     A Mark, John
#2     C       Luke

资料

df <- data.frame(Title, Names)

0
投票

在Base-R中,你可以这样访问它

df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]

    Title            Names
1     A Mark; John; Paul
3     C      Sarah; Luke

0
投票

这可以通过对那些被认为是 "不可能的 "的人进行子集来实现。Names 的第一个参数中指定的模式相匹配。grepl:

df[grepl(paste0(authors, collapse = "|"), df[,2]),]
     Title Names             
[1,] "A"   "Mark; John; Paul"
[2,] "C"   "Sarah; Luke"
© www.soinside.com 2019 - 2024. All rights reserved.