我有一个 for 循环可以满足我的需要,但是,我想知道该循环是否可以进一步优化。
我有一本字典,其中每个键都有一个与其关联的值列表。值列表基本上是入藏号。对于每个键,id 喜欢循环遍历我的 fasta 文件,并创建一个新的 fasta 文件,其中仅包含每个给定键的值列表中的入藏号序列。
到目前为止我的循环是休闲的......
seqs=open('file_name.fasta', 'r')
seq_lines=seqs.readlines()
for i in d.keys():
fasts=open(i,'w')
for index, line in enumerate(seq_lines):
for j in d[i]: #loops thru the accession number (ie lis of values for each key)
if (j in line): #checks to see if acc number is in fasta file
header=seq_lines[index] #the header from fasta file
nuc=seq_lines[index+1] # The sequence after each header
fasts.write(header.rstrip()+'\n'+nuc)
d 是我的字典,seq_lines 是 fasta 文件。这个循环工作得很好,但是当任何给定键有大量序列时(fasta 文件总共有大约 50000 seq),它会很慢。感谢任何意见。
数据示例
这是带有值列表的字典 d={'范围1': ['MZ193866', 'OK234148'], '范围2': ['MZ371332', 'MZ315819']}
Fasta 文件,标题中包含值(它是一个文本文件,不知道如何直接上传)
seq='>MZ193866 啊啊啊啊啊啊啊啊啊 >MZ371332 GGGGCTTCTTTTCCA > OK234148 啊啊啊啊啊啊 >MZ315819 ATTTTCCCCGGGGAAAAA'
最后我希望在键后面有文件名,仅包含其值列表中的序列...
范围1 = '>MZ193866 啊啊啊啊啊啊啊啊啊 > OK234148 啊啊啊啊啊啊啊
范围2 = '>MZ371332 GGGGCTTCTTTTCCA >MZ315819 ATTTTCCCCGGGGAAAAA'
您可以按每个标题进行分组,提取入藏号并将其用作 fasta 记录的索引。现在,您可以在索引中查找,而不是扫描所有记录来查找每个感兴趣的数字。
在这里,我使用了正则表达式来解析标头 - 它非常特定于这种 fasta 格式。具有多个值的 fasta 会更复杂。
from itertools import groupby
import re
# single column fasta header with optional whitespace before >
hdr_match = re.compile(r"\s*>").match
hdr_extract = re.compile(r"\s*>\s*([^\s]+)").match
# test data
d={'range1': ['MZ193866', 'OK234148'], 'range2': ['MZ371332', 'MZ315819']}
open("file_name_test.fasta", "w").write(""">MZ193866
ATGTTTTATTATTT
>MZ371332
GGGGCTTCTATTTCCA
> OK234148
AGGGCTTTTAAAAA
>MZ315819
ATTTTCCCCGGGGAAAAA
""")
# fasta header is a single (whatever its called)
# e.g., " >MZ193866 \n" (note the non-conforming space starting some headers)
fasta_index = {}
for is_hdr, lines in groupby(open('file_name_test.fasta'), hdr_match):
if is_hdr:
hdr = hdr_extract(next(lines)).group(1)
else:
fasta_index[hdr] = "".join(line.strip() for line in lines)
for name, values in d.items():
with open(name, "w") as fasts:
for value in values:
if value in fasta_index:
fasts.write(f">{value}\n{fasta_index[value]}\n")
print("------- output -------")
for name in d:
print(name)
print(open(name).read())
print("--------")
这里有一些学习点/促进剂。几个关键概念:
限制循环。现在,您正在循环遍历
d
中每个键的所有序列数据。因此,如果您有 50k 个序列和 100 个键,则您将执行 5M 次循环。您应该(并且可以)将所有内容循环一次,总共执行 50,100 次
在字符串中搜索字符串是昂贵的。现在,您正在搜索所有内容(包括序列)来查找键值,这可能容易出错并且成本肯定很高。与起始角色
>
一起识别感兴趣的线条,并仔细处理它们。
(稍微回到#1)循环遍历匹配列表是昂贵的。在您当前的代码中,您实际上是 3 嵌套在 for 循环中,即(继续前面的示例)50k x 100(感兴趣的键)x 10k(访问号)~ 50B 执行。尽可能使用集合来匹配事物组,这几乎是瞬时的(更长的讨论)。
只有当您已经知道有东西要写时,您才应该打开文件。而且,正如评论中提到的,您需要关闭它们。
with
上下文管理器会为您处理该部分。
因此,下面的代码将所有序列数据加载到内存中,这对于 50K 序列应该很容易实现,除非序列非常大。
它还有一个用于标头的解析器,用于查找管道符号之间的登录号,但我刚刚根据上面的示例数据将其设置为“可选”,因此它将查找它们,但如果不存在,则假设标头是入藏号。注意:如果您无法从标头中的其他内容中清晰地解析出数字,您可能需要恢复到字符串搜索,但我猜标头结构良好。
> sp|4456
ABCDEFGHIJK
> sp| 5688 | stuff
DEFGHAEAGHHGAHA
KRKABC
>sp|XR650L | other stuff
IIJJHYUXY
HHGGF
> pir | JH880
BBBBDDDEEEFFFF
>pir|777|nonsense
EEEFFFHHHJJINIIIJIJIJI
> MN5502
UUUMNNMNNNNDDD
SSSTTYUAB
# capture all the data into a dictionary...
sequences = {} # accession : (header, sequence)
with open('dummy.fasta', 'r') as src:
# set up a loop
accession_num = None
curr_sequence = None
line = src.readline()
while line:
# search for ">"
if line[0] == '>':
# if we were building a sequence, it is now done, add it...
if curr_sequence:
sequences[accession_num] = (curr_header, curr_sequence)
accession_num = None
curr_sequence = None
# found a new header, now get the accession number
curr_header = line.strip()
tokens = line.split('|') # tokenize the line based on pipe symbol
if len(tokens) > 1: # found pipe symbols
tokens = [t.strip() for t in tokens]
accession_num = tokens[1] # ASSUMES that it is in the 2nd slot
# might implement some check here on validity of the number captured...
else:
# assume the whole thing is the Accession Number (per example)
accession_num = curr_header.replace('>','').strip() # remove the > and any preceeding whitespace
elif accession_num: # only need to go here if we already have a number
# keep building the sequence
if curr_sequence is None:
# start a new one
curr_sequence = ''
if len(line) > 0:
curr_sequence += line.strip() # the strip() removes whitespace and newline chars at ends
line = src.readline() # <- this will be None when empty, killing the loop
# need to add the last one
if curr_header and curr_sequence:
sequences[accession_num] = (curr_header, curr_sequence)
for k in sequences: # for testing...
print(k, sequences[k])
stuff_of_interest = {'k1': {'JH880', '5688'},
'k2': {'4456', '99999999999'},
'k3': {'4456', '5688', 'XR650L', 'JH880', 'MN5502'}}
# match stuff_of_interest to sequences and write results
for k in stuff_of_interest.keys():
matched_keys = stuff_of_interest[k] & sequences.keys()
# check for things that matched nothing: a problem?
non_matches = stuff_of_interest[k] - sequences.keys()
if non_matches:
print(f'failed to match these accession numbers for key {k}: {non_matches}')
if matched_keys:
# open a file and write it...
f_name = '.'.join([k, 'fasta'])
with open(f_name, 'w') as f_out:
for k in matched_keys:
f_out.write(sequences[k][0])
f_out.write('\n')
# and the sequence
f_out.write(sequences[k][1])
f_out.write('\n')
> sp|4456
ABCDEFGHIJK
>sp|XR650L | other stuff
IIJJHYUXYHHGGF
> MN5502
UUUMNNMNNNNDDDSSSTTYUAB
> sp| 5688 | stuff
DEFGHAEAGHHGAHAKRKABC
> pir | JH880
BBBBDDDEEEFFFF