cs50 第 6 周 DNA 程序错误识别 DNA 序列

问题描述 投票:0回答:1

我的代码还可以,只是它的工作对象有选择性。它为特定序列提供了正确的名称,但对于其他序列,它会弄乱。

例如,它会正确识别出一条链属于 Bob,但会将假定的“不匹配”链与“Charlie”相匹配,而“Charlie”甚至不存在于 cs50 给我们的列表中。

这真的很奇怪,我对照其他人检查了我的代码,他们似乎大多相似。不知道为什么会这样,希望有帮助。

import csv
import sys

def main():

    # TODO: Check for command-line usage
    if len(sys.argv) != 3:
        sys.exit("Usage: python dna.py data.csv sequence.txt")

    # TODO: Read database file into a variable
    database = []

    with open(sys.argv[1], 'r') as file:
        reader = csv.DictReader(file)

        for row in reader:
            database.append(row)
 
    # TODO: Read DNA sequence file into a variable
    with open(sys.argv[2], 'r') as file:
        dna_sequence = file.read()

    # TODO: Find longest match of each STR in DNA sequence
    subsequences = list(database[0].keys())[1:]

    results = {}
    for subsequence in subsequences:
        match = 0
        results[subsequence] = longest_match(dna_sequence, subsequence)
        match += 1

    # TODO: Check database for matching profiles
    for person in database:
        for subsequence in subsequences:
            if int(person[subsequence]) == results[subsequence]:
                match += 1
        
            if match == len(subsequence):
                print(person["name"])
                return 

    print("No match")
    return


def longest_match(sequence, subsequence):
    """Returns length of longest run of subsequence in sequence."""

    # Initialize variables
    longest_run = 0
    subsequence_length = len(subsequence)
    sequence_length = len(sequence)

    # Check each character in sequence for most consecutive runs of subsequence
    for i in range(sequence_length):

        # Initialize count of consecutive runs
        count = 0

        # Check for a subsequence match in a "substring" (a subset of characters) within
        #sequence
        # If a match, move substring to next potential match in sequence
        # Continue moving substring and checking for matches until out of consecutive matches
        while True:

            # Adjust substring start and end
            start = i + count * subsequence_length
            end = start + subsequence_length

            # If there is a match in the substring
            if sequence[start:end] == subsequence:
                count += 1
        
            # If there is no match in the substring
            else:
                break
    
        # Update most consecutive matches found
        longest_run = max(longest_run, count)

    # After checking for runs at each character in seqeuence, return longest run found
    return longest_run

main()
python cs50 dna-sequence
1个回答
0
投票

你还在做这件事吗?如果是这样,则有 2 个数据库和 20 个序列需要测试。 (它们在 DNA PSET 末尾列出了正确答案。)哪一个给出了上述错误?我怀疑这是第三次测试。它说运行你的程序为

python dna.py databases/small.csv sequences/3.txt
。你的程序应该输出
No match

当我这样做时,你的程序输出

Charlie
而不是
No match

您需要检查的子序列是:
['AGATC', 'AATG', 'TATC']

您的子序列计数是:
{'AGATC': 3, 'AATG': 3, 'TATC': 5}

这与小.csv 文件中的任何人都不匹配。
Charlie 很接近,但他的 DNA 子序列计数是:
('AGATC', '3'), ('AATG', '2'), ('TATC', '5')

当您将每个人与后续计数进行比较时,就会出现错误。有 3 件事需要解决:

  1. match
    的值设置为上一个循环(
    for subsequence in subsequences:
    )。 In 需要处于
    for person in database:
    循环中。
  2. 需要修改要测试的缩进
    match
    。 (这是在第二个 for
    subsequence in subsequences:
    循环内。)
  3. 您正在针对
    match
    测试
    len(subsequence)
    。想想看....

我做了这些更改,它适用于所有 4 个

small.csv
测试和我尝试过的 3 个
large.csv

© www.soinside.com 2019 - 2024. All rights reserved.