我有一个看起来像这样的FASTA:
NZ_UARI01000011.1阪崎肠杆菌菌株NCTC11467,全基因组shot弹枪序列GCGCATTTCTTATTACGGAGAAATACAGCAGCGTGTCTGTTTCAATTTTCAGCTTGTTCCGGATTGTTAAAGAGCAAATACTT ...NZ_UARI01000001.1阪崎肠杆菌菌株NCTC11467,全基因组shot弹枪序列CAATTTTACTTGTTGATATAACAATCACGCTAACTATTCAGCCAATAGCTCCCGCATTAAAACCAGCTACTTCAGCCAAA...
而且我想将标题更改为此:
'>阪崎克罗诺杆菌菌株NCTC11467_1GCGCATTTCTTATTACGGAGAAATACAGCAGCGTGTCTGTTTCAATTTTCAGCTTGTTCCGGATTGTTAAAGAGCAAATACTT ...'>阪崎克罗诺杆菌菌株NCTC11467_2CAATTTTACTTGTTGATATAACAATCACGCTAACTATTCAGCCAATAGCTCCCGCATTAAAACCAGCTACTTCAGCCAAA...(等等)(忽略标题开头的')
然后,我想用标题名称保存此文件。理想情况下,我不想制作新的Fasta,而只需将文件替换为更正:Cronobacter_sakazakii_strain NCTC11467.fasta现在这很容易单独进行,但是我有600多个文件。因此,做每个人的想法不是我想走的路。我在这里编写了一个脚本,在其中我使用正则表达式隔离想要的标头部分,并将其存储在名为new_new的列表中。然后,我想匹配这些值并替换为以'>'开头的每一行,然后再添加_1 / 2/3 / ...或#(如上所示)。您能帮我完成这项任务吗?如果我提供的脚本不值得继续,并且您有更好的解决方案,请告诉我。
#usr/bin/python import sys import os import re import csv #sys.argv[1] =fasta #sys.argv[2] = list of header names (mass) #Gather existing headers to list (new_new) with open(sys.argv[1], "r+") as text_file: lines = text_file.readlines()[1:] mylist = [] new_new = [] for i in lines: if '.' in i: mylist.append(i) pattern = r">*Cronobacter +\w* +\w* +.*[,]" regex = re.compile(pattern, re.IGNORECASE) for j in mylist: for match in regex.finditer(j): value = match.group(0) new_new.append(value) for k in lines: if '>' in k: k= k.replace('.*',new_new[value]) text_file.close() '''
import os
import re
from Bio.SeqIO.FastaIO import SimpleFastaParser
#sys.argv[1] =fasta
fastas = []
filename= sys.argv[1]
newfilename = ''
with open(filename, "r") as text_file:
fastas = list(SimpleFastaParser(text_file))
for idx, (id, seq) in enumerate(fastas):
s = re.search(r"Cronobacter +\w* +\w* +.*(?=,)", id, re.IGNORECASE)
fastas[idx] = s.group(), seq
newfilename = fastas[0][0] + '.fasta'
with open(filename, 'w') as text_file:
for idx, (id, seq) in enumerate(fastas):
text_file.write(f'>{id}_{idx + 1}\n{seq}\n')
os.rename(filename, newfilename)