如何使用 bash 替换多行 FASTA 文件中的特定字符串模式?

问题描述 投票:0回答:1

我有一个大型多行 FASTA 文件,如下所示:

>NWQ47741.1 CLTR1 protein, partial [Melospiza_melodia] Vertebrate
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
>XP_002763076.2 cysteinyl leukotriene receptor 1 [Callithrix_jacchus] Vertebrate
MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
>XP_036988076.1 cysteinyl leukotriene receptor 1 [Artibeus_jamaicensis] Vertebrate
MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK

我需要替换文件中每个序列中的特定子字符串。要替换的子字符串是入藏 ID 之后、第一个“[”字符之前的部分。例如,在第一个序列中,我想将“CLTR1 Protein,partial”替换为“new_string”,结果是:

>NWQ47741.1 new_string [Melospiza_melodia] Vertebrate
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE

我正在寻找一种使用 AWK 或 Sed 来实现此目的的方法,因为文件很大,并且我想要一个有效的解决方案。有人可以提供此任务的示例脚本或命令吗?

谢谢您的帮助!

我尝试了以下脚本:

awk '/^>/ {gsub(/\[[^[]*/, "new_string"); print; next} 1' your_file.fasta > modified_file.fasta

但得到以下输出:

>NWQ47741.1 CLTR1 protein, partial new_string
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
>XP_002763076.2 cysteinyl leukotriene receptor 1 new_string
MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
>XP_036988076.1 cysteinyl leukotriene receptor 1 new_string
MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK
bash awk sed bioinformatics fasta
1个回答
0
投票

使用

sed

$ sed -E '/^>/s/ [^[]*/ new_string /' input_file
>NWQ47741.1 new_string [Melospiza_melodia] Vertebrate
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
>XP_002763076.2 new_string [Callithrix_jacchus] Vertebrate
MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
>XP_036988076.1 new_string [Artibeus_jamaicensis] Vertebrate
MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK
© www.soinside.com 2019 - 2024. All rights reserved.