从 R 中的分隔字符串中提取模式

问题描述 投票:0回答:2

我的 ID 字符串由

|
分隔。我想提取以
ENSG
开头的 ID。如果字符串中有多个与此模式匹配的 ID,我只想获取第一个匹配项。

teststring <- c("ENSG00032423", "ENSG00032411234|ENSMUS000124124", "ENSMUS00068967832|ENSG00045345112", "ENSG00032411234|ENSG000865852297",  "ENSMUS00068967832|ENSG00045345112|ENSR00072699324", "ENSMUS00068967832|ENSR000124124124|ENSG00045345112")

gsub(pattern = ".*(ENSG[0-9]+[^|]).*",
     replacement = "\\1",
     x = teststring)

[1] "ENSG00032423"     "ENSG00032411234"  "ENSG00045345112"  "ENSG000865852297" "ENSG00045345112"  "ENSG00045345112"

我尝试使用

gsub
但我使用的正则表达式在多个匹配的情况下仅返回最后一个匹配。

r regex string pattern-matching
2个回答
1
投票

尝试:

teststring <- c(
  "ENSG00032423",
  "ENSG00032411234|ENSMUS000124124",
  "ENSMUS00068967832|ENSG00045345112",
  "ENSG00032411234|ENSG000865852297", 
  "ENSMUS00068967832|ENSG00045345112|ENSR00072699324", 
  "ENSMUS00068967832|ENSR000124124124|ENSG00045345112"
  )

gsub(pattern = "(?:(?!ENSG).)*(ENSG[0-9]+[^|]).*",
     replacement = "\\1",
     perl = TRUE,
     x = teststring)
   

gsub(pattern = ".*?(ENSG[0-9]+[^|]).*",
     replacement = "\\1",
     perl = TRUE,
     x = teststring)

参见:regex101


说明


0
投票

你可以使用

gsub(
   pattern = "(?:(?<![^|])|\\|(?=[^|]*$))(?!ENSG)[^|]*\\|?",
   replacement = "",
   x = teststring,
   perl=TRUE
)

查看 R 在线演示regex 演示

© www.soinside.com 2019 - 2024. All rights reserved.