我的 ID 字符串由
|
分隔。我想提取以 ENSG
开头的 ID。如果字符串中有多个与此模式匹配的 ID,我只想获取第一个匹配项。
teststring <- c("ENSG00032423", "ENSG00032411234|ENSMUS000124124", "ENSMUS00068967832|ENSG00045345112", "ENSG00032411234|ENSG000865852297", "ENSMUS00068967832|ENSG00045345112|ENSR00072699324", "ENSMUS00068967832|ENSR000124124124|ENSG00045345112")
gsub(pattern = ".*(ENSG[0-9]+[^|]).*",
replacement = "\\1",
x = teststring)
[1] "ENSG00032423" "ENSG00032411234" "ENSG00045345112" "ENSG000865852297" "ENSG00045345112" "ENSG00045345112"
我尝试使用
gsub
但我使用的正则表达式在多个匹配的情况下仅返回最后一个匹配。
尝试:
teststring <- c(
"ENSG00032423",
"ENSG00032411234|ENSMUS000124124",
"ENSMUS00068967832|ENSG00045345112",
"ENSG00032411234|ENSG000865852297",
"ENSMUS00068967832|ENSG00045345112|ENSR00072699324",
"ENSMUS00068967832|ENSR000124124124|ENSG00045345112"
)
gsub(pattern = "(?:(?!ENSG).)*(ENSG[0-9]+[^|]).*",
replacement = "\\1",
perl = TRUE,
x = teststring)
gsub(pattern = ".*?(ENSG[0-9]+[^|]).*",
replacement = "\\1",
perl = TRUE,
x = teststring)
参见:regex101
说明
(?:(?!ENSG).)*
:匹配尽可能多的字符,同时确保它不是“ENSG”ID 的开头(https://www.rexegg.com/regex-quantifiers.html#tempered_greed)或
.*?
:延迟匹配一个字符 (https://www.rexegg.com/regex-quantifiers.html#lazy_solution)(ENSG[0-9]+[^|])
:匹配并捕获ID.*
:匹配行的其余部分