vi regex:查找包含KSP蛋白片段的特定染色体及其片段

问题描述 投票:-1回答:1

有一个蛋白质序列的fasta文件,我想找到包含3个KSP氨基酸的2号或5号染色体。如何编写模式字符串。

这里是fasta文件的简要概述:

>AT1G05230.1
MFEPNMLLAAMNNADSNNHNYNHEDNNNEGFLRDDEFDSPNTKSGSENQEGGSGNDQDPLHPNKKKRYHRHTQLQIQEME
. . . .
DFLRDENSRNEWDILSNGGVVQEMAHIANGRDTGNCVSLLRVNSANSSQSNMLILQESCTDPTASFVIYAPVDIVAMNIV
LNGGDPDYVALLPSGFAILPDGNANSGAPGGDGGSLLTVAFQILVDSVPTAKLSLGSVATVNNLIACTVERIKASMSCET
A*

>AT1G05230.2
MFEPNMLLAAMNNADSNNHNYNHEDNNNEGFLRDDEFDSPNTKSGSENQEGGSGNDQDPLHPNKKKRYHRHTQLQIQEME
. . . . . .
DFLRDENSRNEWDILSNGGVVQEMAHIANGRDTGNCVSLLRVNSANSSQSNMLILQESCTDPTASFVIYAPVDIVAMNIV
LNGGDPDYVALLPSGFAILPDGNANSGAPGGDGGSLLTVAFQILVDSVPTAKLSLGSVATVNNLIACTVERIKASMSCET
A*

>AT2G35940.1
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
..........
....KSP......TNYHMNPNHNGDLEGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
..........
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*

>AT2G35940.2
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
................................................................................
RAWLFEHFLHPYPKDSDKHMLAKQTGLTRSQVSNWFINARVRLWKPMVEEMYMEEMKEQAKNMGSMEKTPLDQSNEDSAS
.....KSP..................EGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
................................................................................
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*

>AT3G03660.1
MDQEQTPHSPTRHSRSPPSSASGSTSAEPVRSRWSPKPEQILILESIFHSGMVNPPKEETVRIRKMLEKFGAVGDANVFY
................................................................................
VPLPTDEFGFLMHSLQHGEAYFLVPRQT*

>AT3G11260.1
MSFSVKGRSLRGNNNGGTGTKCGRWNPTVEQLKILTDLFRAGLRTPTTDQIQKISTELSFYGKIESKNVFYWFQNHKARE
................................................................................
PYSSCGAEMEHPPPLDLRLSFL*

>AT3G61890.1
MEEGDFFNCCFSEISSGMTMNKKKMKKSNNQKRFSEEQIKSLELIFESETRLEPRKKVQVARELGLQPRQVAIWFQNKRA
...KSP..........................................................................
RLDQGSVLCNDGDYNNNIKTEYFGFEEETDHELMNIVEKADDSCLTSSENWGGFNSDSLLDQSSSNYPNWWEFWS*

................................................................................
(lots of sequences)
................................................................................

>AT5G11060.1
MAFHNNHFNHFTDQQQHQPPPPPQQQQQQHFQESAPPNWLLRSDNNFLNLHTAASAAATSSDSPSSAAANQWLSRSSSFL
................................................................................
SVLKSWWQSHSKWPYPTEEDKARLVQETGLQLKQINNWFINQRKRNWHSNPSSSTVSKNKRRSNAGENSGRDR*

>AT5G15150.1
MYMYEEERNNINNNQEGLRLEMAFPQHGFMFQQLHEDNAHHLPSPTSLPSCPPHLFYGGGGNYMMNRSMSFTGVSDHHHL
..KSP...........TTTNNMNDQDQVGEEDNLSDDGSHMMLGEKKKRLNLEQVRALEKSFELGNKLEPERKMQLAKAL
QNRRARWKTKQLERDYDSLKKQFDVLKSDNDSLLAHNKKLHAELVALKKHDRKESAKIKREFAEASWSNNGSTENNHNNN
SSDANHVSMIKDLFPSSIRSATATTTSTHIDHQIVQDQDQGFCNMFNGIDETTSASYWAWPDQQQQHHNHHQFN*

[首先,我可以写与2或5号染色体匹配的模式字符串,例如>AT[25]G。当我这样写模式字符串(>AT[25]G.*KSP.*)以匹配满足条件的序列时,我失败了。

顺便说一下,所有序列都以大于符号>开头并以星号*结尾,并且所有氨基酸均大写。

例如,预期结果将是2号和5号染色体上KSP的所有三个氨基酸的序列

>AT2G35940.1
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
..........
....KSP......TNYHMNPNHNGDLEGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
..........
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*

>AT2G35940.2
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
................................................................................
RAWLFEHFLHPYPKDSDKHMLAKQTGLTRSQVSNWFINARVRLWKPMVEEMYMEEMKEQAKNMGSMEKTPLDQSNEDSAS
.....KSP..................EGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
................................................................................
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*

>AT5G15150.1
MYMYEEERNNINNNQEGLRLEMAFPQHGFMFQQLHEDNAHHLPSPTSLPSCPPHLFYGGGGNYMMNRSMSFTGVSDHHHL
..KSP...........TTTNNMNDQDQVGEEDNLSDDGSHMMLGEKKKRLNLEQVRALEKSFELGNKLEPERKMQLAKAL
QNRRARWKTKQLERDYDSLKKQFDVLKSDNDSLLAHNKKLHAELVALKKHDRKESAKIKREFAEASWSNNGSTENNHNNN
SSDANHVSMIKDLFPSSIRSATATTTSTHIDHQIVQDQDQGFCNMFNGIDETTSASYWAWPDQQQQHHNHHQFN*

我如何在vim中编写正则表达式以使其匹配,希望您能对我有所帮助,非常感谢您阅读我的问题。

regex vi
1个回答
0
投票

这是多行搜索。请尝试以下类似方法,并根据需要进行修改。我在匹配的字符类中包括了换行符,制表符和字母数字。

^>AT[25]G[\t\n[:alnum:].]*KSP[\t\n[:alnum:].]*\*$

© www.soinside.com 2019 - 2024. All rights reserved.