我试图使用str_extract从文本字符串中提取 "Present"、"Retained "或 "Absent "等字样,并对空格和标点进行可变格式化。我的逻辑哪里出了问题?
test<-c("as follows: ABC Staining Present in Tissue","ABC: Retained in the tumor cells ","as follows: ABC Staining is Absent ABC","as follows: ABC Staining is missing in Tissue","as follows: ABC: StainingAbsent in Tissue","as follows: ABC: Staining Present in Tissue","as follows ABC Staining Present ABC")
pattern<-"ABC[:\\s]*[STAINING\\s]*(.*?)(?=\\s*\\bIN|ABC\\b)"
str_match(toupper(test), pattern)[,2]
你可以使用 stringr::str_match
:
test<-c("as follows: ABC Staining Absent in Tissue","as follows: ABC: StainingPresent in Tissue","as follows: ABC: Staining Present in Tissue","as follows ABC Staining Present in Tissue extra words here in Present")
library(stringr)
pattern<-"ABC[:\\s]*Staining[:\\s]*(.*?)(?=\\s*\\bin\\b)"
unique(str_match(test, pattern)[,2])
## => [1] "Absent" "Present"
详情
ABC
- ABC
绳子[:\s]*
- 0个或更多的冒号或空格。Staining
- a Staining
绳子[:\s]*
- 0个或更多的冒号或空格。(.*?)
-第1组:除换行符外的任何零或多字符,尽量少用。(?=\s*\bin\b)
- 正面看头,需要0个以上的空格,然后是一个完整的单词。in
紧靠当前位置的右侧。这似乎是可行的。
str_extract(test, "(?<=Staining\\s{0,5})\\w+")
[1] "Absent" "Present" "Present" "Present"
针对你的问题,你的逻辑有问题。你的逻辑似乎是正确的,尽管在我看来,你在模式中挤出了太多的非定义信息(例如,似乎没有必要在模式中加入 ABC
或积极的展望 (?=in)
). 主要问题是 句法 的性质。看看你得到的错误。Look-Behind pattern matches must have a bounded maximum length. (U_REGEX_LOOK_BEHIND_LIMIT)
这个... *
次或以上,也就是说,它是 不 边界长度的。量化与 {0,5}
(或任何其他数字,而不是 5
) 是 的有界最大长度,因此是可以接受的。