我有一个大型的已发表文章的数据框架,我想提取与少数作者有关的所有文章,并将其指定在一个单独的列表中。数据框中的作者被分组在一列中,用 ;分隔。并非所有的作者都需要匹配,我想提取任何有一个作者与列表匹配的文章。下面是一个例子。
Title<-c("A", "B", "C")
AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")
df<-cbind(Title, AU)
authors<-as.character(c("Mark", "John", "Luke"))
df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]
我希望能返回。
Title AU
A Mark; John
C Sarah; Luke
然而在我的大数据框架中,这个命令并不能返回所有的AU,它只返回有一个AU的行,而不是多个AU。
下面是我的大数据框架中的5条记录的dput
structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS",
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE",
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND",
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA",
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS",
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY",
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.",
"FUEL", "AGRICULTURAL HISTORY")
一个带有 str_extract
library(dplyr)
library(stringr)
df %>%
mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>%
filter(lengths(Names) > 0)
# Title Names
#1 A Mark, John
#2 C Luke
df <- data.frame(Title, Names)
在Base-R中,你可以这样访问它
df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]
Title Names
1 A Mark; John; Paul
3 C Sarah; Luke
这可以通过对那些被认为是 "不可能的 "的人进行子集来实现。Names
的第一个参数中指定的模式相匹配。grepl
:
df[grepl(paste0(authors, collapse = "|"), df[,2]),]
Title Names
[1,] "A" "Mark; John; Paul"
[2,] "C" "Sarah; Luke"