我正在尝试在 R 中处理 Google 表单的结果,但在处理字符串数据时遇到了困难。
问题可以在这里看到:
Google 在单列中返回结果,并用逗号分隔每个响应。
他们最终看起来像
ID | Type of Research
=====================
1 | Policy analysis, Review of other research
2 | Bla
3 | Review of other research, Original empirical research
4 | Policy analysis, Theoretical
5 | Review of other research
我使用 grepl 为三个预选响应创建逻辑列和 data.frame。
Private$ResearchTypeOriginal <- grepl("Original", Private$ResearchType)
Private$ResearchTypeReview <- grepl("Review", Private$ResearchType)
Private$ResearchTypePolicy <- grepl("Policy", Private$ResearchType)
ResearchTypeGrid <- data.frame(Private$ResearchTypeOriginal, Private$ResearchTypeReview, Private$ResearchTypePolicy)
这个效果很好。然而,我还需要把“其他”拿出来。我正在使用
ResearchTypeOther <- subset(Private, !grepl("Original", Private$ResearchType) & !grepl("Review", Private$ResearchType) & !grepl("Policy", Private$ResearchType), select=c(ID, ResearchType, PubLang, Reviewer))
ResearchTypeOther <- na.omit(ResearchTypeOther)
但刚刚意识到,如果响应既有预选响应又有开放式响应,那么使用此方法就会丢失。它可以很好地为我提供“Bla”响应,但仅限于那些完全是“其他”的响应。
换句话说,这会产生
ID | Type of Research
=======================
2 | Bla
但我想要的是
ID | Type of Research
======================
2 | Bla
4 | Policy analysis, Theoretical
这是我第一次在 SO 上发帖,而且我显然是 R 新手,所以请原谅我提出问题的方式中的任何错误。如果我没有很好地表达这一点,我很抱歉。我还有大约 20 个其他问题也有同样的问题,所以我需要一个灵活的解决方案。
感谢您的帮助。
您可以按照
的脉络“通过正则表达式解决问题”doc <- readLines(n = 5)
1 | Policy analysis, Review of other research
2 | Bla
3 | Review of research, Original empirical research
4 | Policy analysis, Theoretical
5 | Review of other research
items <- c("Review of other research",
"Original empirical research",
"Policy analysis")
(others <- gsub(sprintf("(,\\s)?(%s)(,\\s)?", paste(items, collapse = "|")), "",
sub(".*\\|\\s(.*)", "\\1", doc)))
# [1] "" "Bla" "Review of research"
# [4] "Theoretical " ""
sub(sprintf("(,\\s)?(%s)(,\\s)?", paste(others[others != ""], collapse = "|")), "", doc)
# [1] "1 | Policy analysis, Review of other research"
# [2] "2 | "
# [3] "3 | Original empirical research"
# [4] "4 | Policy analysis"
# [5] "5 | Review of other research"
感谢卢克,得到了它。一点也不优雅,但这很有效:
items <- c("Review of other research",
"Original empirical research",
"Policy analysis")
ResearchTypeOther <- data.frame((others <- gsub(sprintf("(,\\s)?(%s)(,\\s)?", paste(items, collapse = "|")), "",
sub(".*\\|\\s(.*)", "\\1", Private$ResearchType))))
ResearchTypeOther[ResearchTypeOther==""] <- NA
ResearchTypeOther <- na.omit(ResearchTypeOther)
你可以尝试:(使用来自@lukeA的
doc
和items
)
library(stringr)
doc[sapply(strsplit(doc, "\\d +\\||,"), function(x) {
x1 <- str_trim(x)
x2 <- x1[x1!='']
indx <- x2 %in% items
!(any(indx) & tail(indx,1))})]
#[1] "2 | Bla" "4 | Policy analysis, Theoretical