在具有dna序列的字典中查找重复次数最多的子字符串

问题描述 投票:0回答:1

子字符串必须包含6个字符。我叫gettig的数字比应该的小。

首先,我编写了代码以从文件中获取序列,然后将其放入字典中,然后编写了3个嵌套的for循环:第一个遍历字典并在每次迭代中获取一个序列。第二个获取每个序列,并从中获得一个包含6个字符的子字符串。在每次迭代中,第二个循环将字符串的开头(长序列)的索引增加1。第三个循环从第二个循环中获取每个子字符串,并计算它在每个字符串(长序列)中出现的次数。

我尝试多次重写代码。我想我离得很近。我检查了循环是否确实进行了迭代,并且进行了。我什至手动检查了随机序列中子字符串的计数是否与程序给出的相同,并且相同。任何的想法?也许是另一种方法?

我添加了一个具有3个缩短序列的文件进行测试。也许尝试使用较小的子字符串:用3个字符而不是6个字符说:rep_len = 3

代码

matches = []
count = 0
final_count = 0
rep_len = 6
repeat = ''
pos  = 0
seq_count = 0
seqs = {}
f = open(r"file.fasta")
# inserting each sequences from the file into a dictionary
for line in f:
    line = line.rstrip()
    if line[0] == '>':
        seq_count += 1
        name = seq_count
        seqs[name] = ''
    else:
        seqs[name] += line
for key, seq in seqs.items():  # getting one sequence in each iteration
    for pos in range(len(seq)):  # setting an index and increasing it by 1 in each iteration
        if pos <= len(seq) - rep_len: # checking no substring from the end of the sequence are selected
            repeat = seq[pos:pos + rep_len] # setting a substring
            if repeat not in matches: # checking if the substring was already scanned
                matches.append(repeat) # adding the substring to previously checked substrings' list
                for key1, seq2 in seqs.items(): # iterating over each sequence
                    count += seq2.count(repeat) # counting the substring's repetitions
                if count > final_count: # if the count is greates from the previously saved greatest number
                    final_count = count # the new value is saved
                count = 0    
print('repetitions: ', final_count) # printing

sequences.fasta

python biopython
1个回答
0
投票

代码不是很清楚,因此调试起来有点困难。我建议重写。

无论如何,我(当前)只是注意到一个小错误:

        if pos < len(seq) - rep_len:

应该是

        if pos <= len(seq) - rep_len:

当前,每个序列中的最后一个字符都将被忽略。

© www.soinside.com 2019 - 2024. All rights reserved.