优化 for 循环以使用字典值格式化新的 fasta 文件

问题描述 投票:0回答:2

我有一个 for 循环可以满足我的需要,但是,我想知道该循环是否可以进一步优化。

我有一本字典,其中每个键都有一个与其关联的值列表。值列表基本上是入藏号。对于每个键,id 喜欢循环遍历我的 fasta 文件,并创建一个新的 fasta 文件,其中仅包含每个给定键的值列表中的入藏号序列。

到目前为止我的循环是休闲的......

seqs=open('file_name.fasta', 'r')
seq_lines=seqs.readlines()

for i in d.keys():
    fasts=open(i,'w')
    for index, line in enumerate(seq_lines):
        for j in d[i]: #loops thru the accession number (ie lis of values for each key) 
            if (j in line): #checks to see if acc number is in fasta file
                header=seq_lines[index] #the header from fasta file
                nuc=seq_lines[index+1] # The sequence after each header
                fasts.write(header.rstrip()+'\n'+nuc)

d 是我的字典,seq_lines 是 fasta 文件。这个循环工作得很好,但是当任何给定键有大量序列时(fasta 文件总共有大约 50000 seq),它会很慢。感谢任何意见。

数据示例

这是带有值列表的字典 d={'范围1': ['MZ193866', 'OK234148'], '范围2': ['MZ371332', 'MZ315819']}

Fasta 文件,标题中包含值(它是一个文本文件,不知道如何直接上传)

seq='>MZ193866 啊啊啊啊啊啊啊啊啊 >MZ371332 GGGGCTTCTTTTCCA > OK234148 啊啊啊啊啊啊 >MZ315819 ATTTTCCCCGGGGAAAAA'

最后我希望在键后面有文件名,仅包含其值列表中的序列...

范围1 = '>MZ193866 啊啊啊啊啊啊啊啊啊 > OK234148 啊啊啊啊啊啊啊

范围2 = '>MZ371332 GGGGCTTCTTTTCCA >MZ315819 ATTTTCCCCGGGGAAAAA'

python loops optimization fasta
2个回答
1
投票

您可以按每个标题进行分组,提取入藏号并将其用作 fasta 记录的索引。现在,您可以在索引中查找,而不是扫描所有记录来查找每个感兴趣的数字。

在这里,我使用了正则表达式来解析标头 - 它非常特定于这种 fasta 格式。具有多个值的 fasta 会更复杂。

from itertools import groupby
import re

# single column fasta header with optional whitespace before >
hdr_match = re.compile(r"\s*>").match
hdr_extract = re.compile(r"\s*>\s*([^\s]+)").match

# test data
d={'range1': ['MZ193866', 'OK234148'], 'range2': ['MZ371332', 'MZ315819']}

open("file_name_test.fasta", "w").write(""">MZ193866 
 ATGTTTTATTATTT 
 >MZ371332 
 GGGGCTTCTATTTCCA 
 > OK234148 
 AGGGCTTTTAAAAA 
 >MZ315819 
 ATTTTCCCCGGGGAAAAA
 """)

# fasta header is a single (whatever its called)
# e.g., " >MZ193866 \n" (note the non-conforming space starting some headers)

fasta_index = {}

for is_hdr, lines in groupby(open('file_name_test.fasta'), hdr_match):
    if is_hdr:
        hdr = hdr_extract(next(lines)).group(1)
    else:
        fasta_index[hdr] = "".join(line.strip() for line in lines)

for name, values in d.items():
    with open(name, "w") as fasts:
        for value in values:
            if value in fasta_index:
                fasts.write(f">{value}\n{fasta_index[value]}\n")

print("------- output -------")
for name in d:
    print(name)
    print(open(name).read())
    print("--------")

1
投票

这里有一些学习点/促进剂。几个关键概念:

  1. 限制循环。现在,您正在循环遍历

    d
    中每个键的所有序列数据。因此,如果您有 50k 个序列和 100 个键,则您将执行 5M 次循环。您应该(并且可以)将所有内容循环一次,总共执行 50,100 次

  2. 在字符串中搜索字符串是昂贵的。现在,您正在搜索所有内容(包括序列)来查找键值,这可能容易出错并且成本肯定很高。与起始角色

    >
    一起识别感兴趣的线条,并仔细处理它们。

  3. (稍微回到#1)循环遍历匹配列表是昂贵的。在您当前的代码中,您实际上是 3 嵌套在 for 循环中,即(继续前面的示例)50k x 100(感兴趣的键)x 10k(访问号)~ 50B 执行。尽可能使用集合来匹配事物组,这几乎是瞬时的(更长的讨论)。

  4. 只有当您已经知道有东西要写时,您才应该打开文件。而且,正如评论中提到的,您需要关闭它们。

    with
    上下文管理器会为您处理该部分。

因此,下面的代码将所有序列数据加载到内存中,这对于 50K 序列应该很容易实现,除非序列非常大。

它还有一个用于标头的解析器,用于查找管道符号之间的登录号,但我刚刚根据上面的示例数据将其设置为“可选”,因此它将查找它们,但如果不存在,则假设标头入藏号。注意:如果您无法从标头中的其他内容中清晰地解析出数字,您可能需要恢复到字符串搜索,但我猜标头结构良好。

数据文件(dummy.fasta)(奇数间距只是为了测试

> sp|4456
ABCDEFGHIJK
> sp| 5688 | stuff
DEFGHAEAGHHGAHA
KRKABC
>sp|XR650L   | other stuff
IIJJHYUXY
HHGGF


>   pir | JH880
BBBBDDDEEEFFFF

>pir|777|nonsense
EEEFFFHHHJJINIIIJIJIJI

> MN5502
UUUMNNMNNNNDDD
SSSTTYUAB

代码:

# capture all the data into a dictionary...
sequences = {}  # accession : (header, sequence)
with open('dummy.fasta', 'r') as src:
    # set up a loop
    accession_num = None
    curr_sequence = None
    line = src.readline()
    while line:
        # search for ">"
        if line[0] == '>':
            # if we were building a sequence, it is now done, add it...
            if curr_sequence:
                sequences[accession_num] = (curr_header, curr_sequence)
                accession_num = None
                curr_sequence = None

            # found a new header, now get the accession number
            curr_header = line.strip()
            tokens = line.split('|')  # tokenize the line based on pipe symbol
            if len(tokens) > 1:  # found pipe symbols
                tokens = [t.strip() for t in tokens]
                accession_num = tokens[1]  # ASSUMES that it is in the 2nd slot
            # might implement some check here on validity of the number captured...
            else:
                # assume the whole thing is the Accession Number (per example)
                accession_num = curr_header.replace('>','').strip()  # remove the > and any preceeding whitespace

        elif accession_num:  # only need to go here if we already have a number
            # keep building the sequence
            if curr_sequence is None:
                # start a new one
                curr_sequence = ''
            if len(line) > 0:
                curr_sequence += line.strip()  # the strip() removes whitespace and newline chars at ends
        line = src.readline()  # <- this will be None when empty, killing the loop

    # need to add the last one
    if curr_header and curr_sequence:
        sequences[accession_num] = (curr_header, curr_sequence)
for k in sequences: # for testing...
    print(k, sequences[k])
stuff_of_interest = {'k1': {'JH880', '5688'},
                     'k2': {'4456', '99999999999'},
                     'k3': {'4456', '5688', 'XR650L', 'JH880', 'MN5502'}}

# match stuff_of_interest to sequences and write results
for k in stuff_of_interest.keys():
    matched_keys = stuff_of_interest[k] & sequences.keys()
    # check for things that matched nothing:  a problem?
    non_matches = stuff_of_interest[k] - sequences.keys()
    if non_matches:
        print(f'failed to match these accession numbers for key {k}: {non_matches}')
    if matched_keys:
        # open a file and write it...
        f_name = '.'.join([k, 'fasta'])
        with open(f_name, 'w') as f_out:
            for k in matched_keys:
                f_out.write(sequences[k][0])
                f_out.write('\n')

                # and the sequence
                f_out.write(sequences[k][1])
                f_out.write('\n')

k3.fasta(输出文件之一)

> sp|4456
ABCDEFGHIJK
>sp|XR650L   | other stuff
IIJJHYUXYHHGGF
> MN5502
UUUMNNMNNNNDDDSSSTTYUAB
> sp| 5688 | stuff
DEFGHAEAGHHGAHAKRKABC
>   pir | JH880
BBBBDDDEEEFFFF
© www.soinside.com 2019 - 2024. All rights reserved.