有一个蛋白质序列的fasta文件,我想找到包含3个KSP氨基酸的2号或5号染色体。如何编写模式字符串。
这里是fasta文件的简要概述:
>AT1G05230.1
MFEPNMLLAAMNNADSNNHNYNHEDNNNEGFLRDDEFDSPNTKSGSENQEGGSGNDQDPLHPNKKKRYHRHTQLQIQEME
. . . .
DFLRDENSRNEWDILSNGGVVQEMAHIANGRDTGNCVSLLRVNSANSSQSNMLILQESCTDPTASFVIYAPVDIVAMNIV
LNGGDPDYVALLPSGFAILPDGNANSGAPGGDGGSLLTVAFQILVDSVPTAKLSLGSVATVNNLIACTVERIKASMSCET
A*
>AT1G05230.2
MFEPNMLLAAMNNADSNNHNYNHEDNNNEGFLRDDEFDSPNTKSGSENQEGGSGNDQDPLHPNKKKRYHRHTQLQIQEME
. . . . . .
DFLRDENSRNEWDILSNGGVVQEMAHIANGRDTGNCVSLLRVNSANSSQSNMLILQESCTDPTASFVIYAPVDIVAMNIV
LNGGDPDYVALLPSGFAILPDGNANSGAPGGDGGSLLTVAFQILVDSVPTAKLSLGSVATVNNLIACTVERIKASMSCET
A*
>AT2G35940.1
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
..........
....KSP......TNYHMNPNHNGDLEGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
..........
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*
>AT2G35940.2
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
................................................................................
RAWLFEHFLHPYPKDSDKHMLAKQTGLTRSQVSNWFINARVRLWKPMVEEMYMEEMKEQAKNMGSMEKTPLDQSNEDSAS
.....KSP..................EGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
................................................................................
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*
>AT3G03660.1
MDQEQTPHSPTRHSRSPPSSASGSTSAEPVRSRWSPKPEQILILESIFHSGMVNPPKEETVRIRKMLEKFGAVGDANVFY
................................................................................
VPLPTDEFGFLMHSLQHGEAYFLVPRQT*
>AT3G11260.1
MSFSVKGRSLRGNNNGGTGTKCGRWNPTVEQLKILTDLFRAGLRTPTTDQIQKISTELSFYGKIESKNVFYWFQNHKARE
................................................................................
PYSSCGAEMEHPPPLDLRLSFL*
>AT3G61890.1
MEEGDFFNCCFSEISSGMTMNKKKMKKSNNQKRFSEEQIKSLELIFESETRLEPRKKVQVARELGLQPRQVAIWFQNKRA
...KSP..........................................................................
RLDQGSVLCNDGDYNNNIKTEYFGFEEETDHELMNIVEKADDSCLTSSENWGGFNSDSLLDQSSSNYPNWWEFWS*
................................................................................
(lots of sequences)
................................................................................
>AT5G11060.1
MAFHNNHFNHFTDQQQHQPPPPPQQQQQQHFQESAPPNWLLRSDNNFLNLHTAASAAATSSDSPSSAAANQWLSRSSSFL
................................................................................
SVLKSWWQSHSKWPYPTEEDKARLVQETGLQLKQINNWFINQRKRNWHSNPSSSTVSKNKRRSNAGENSGRDR*
>AT5G15150.1
MYMYEEERNNINNNQEGLRLEMAFPQHGFMFQQLHEDNAHHLPSPTSLPSCPPHLFYGGGGNYMMNRSMSFTGVSDHHHL
..KSP...........TTTNNMNDQDQVGEEDNLSDDGSHMMLGEKKKRLNLEQVRALEKSFELGNKLEPERKMQLAKAL
QNRRARWKTKQLERDYDSLKKQFDVLKSDNDSLLAHNKKLHAELVALKKHDRKESAKIKREFAEASWSNNGSTENNHNNN
SSDANHVSMIKDLFPSSIRSATATTTSTHIDHQIVQDQDQGFCNMFNGIDETTSASYWAWPDQQQQHHNHHQFN*
[首先,我可以写与2或5号染色体匹配的模式字符串,例如>AT[25]G
。当我这样写模式字符串(>AT[25]G.*KSP.*
)以匹配满足条件的序列时,我失败了。
顺便说一下,所有序列都以大于符号>
开头并以星号*
结尾,并且所有氨基酸均大写。
例如,预期结果将是2号和5号染色体上KSP的所有三个氨基酸的序列
>AT2G35940.1
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
..........
....KSP......TNYHMNPNHNGDLEGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
..........
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*
>AT2G35940.2
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
................................................................................
RAWLFEHFLHPYPKDSDKHMLAKQTGLTRSQVSNWFINARVRLWKPMVEEMYMEEMKEQAKNMGSMEKTPLDQSNEDSAS
.....KSP..................EGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
................................................................................
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*
>AT5G15150.1
MYMYEEERNNINNNQEGLRLEMAFPQHGFMFQQLHEDNAHHLPSPTSLPSCPPHLFYGGGGNYMMNRSMSFTGVSDHHHL
..KSP...........TTTNNMNDQDQVGEEDNLSDDGSHMMLGEKKKRLNLEQVRALEKSFELGNKLEPERKMQLAKAL
QNRRARWKTKQLERDYDSLKKQFDVLKSDNDSLLAHNKKLHAELVALKKHDRKESAKIKREFAEASWSNNGSTENNHNNN
SSDANHVSMIKDLFPSSIRSATATTTSTHIDHQIVQDQDQGFCNMFNGIDETTSASYWAWPDQQQQHHNHHQFN*
我如何在vim
中编写正则表达式以使其匹配,希望您能对我有所帮助,非常感谢您阅读我的问题。
这是多行搜索。请尝试以下类似方法,并根据需要进行修改。我在匹配的字符类中包括了换行符,制表符和字母数字。
^>AT[25]G[\t\n[:alnum:].]*KSP[\t\n[:alnum:].]*\*$