cs50 第 6 周 DNA 程序错误识别 DNA 序列

Question

我的代码还可以，只是它的工作对象有选择性。它为特定序列提供了正确的名称，但对于其他序列，它会弄乱。

例如，它会正确识别出一条链属于 Bob，但会将假定的“不匹配”链与“Charlie”相匹配，而“Charlie”甚至不存在于 cs50 给我们的列表中。

这真的很奇怪，我对照其他人检查了我的代码，他们似乎大多相似。不知道为什么会这样，希望有帮助。

import csv
import sys

def main():

    # TODO: Check for command-line usage
    if len(sys.argv) != 3:
        sys.exit("Usage: python dna.py data.csv sequence.txt")

    # TODO: Read database file into a variable
    database = []

    with open(sys.argv[1], 'r') as file:
        reader = csv.DictReader(file)

        for row in reader:
            database.append(row)
 
    # TODO: Read DNA sequence file into a variable
    with open(sys.argv[2], 'r') as file:
        dna_sequence = file.read()

    # TODO: Find longest match of each STR in DNA sequence
    subsequences = list(database[0].keys())[1:]

    results = {}
    for subsequence in subsequences:
        match = 0
        results[subsequence] = longest_match(dna_sequence, subsequence)
        match += 1

    # TODO: Check database for matching profiles
    for person in database:
        for subsequence in subsequences:
            if int(person[subsequence]) == results[subsequence]:
                match += 1
        
            if match == len(subsequence):
                print(person["name"])
                return 

    print("No match")
    return


def longest_match(sequence, subsequence):
    """Returns length of longest run of subsequence in sequence."""

    # Initialize variables
    longest_run = 0
    subsequence_length = len(subsequence)
    sequence_length = len(sequence)

    # Check each character in sequence for most consecutive runs of subsequence
    for i in range(sequence_length):

        # Initialize count of consecutive runs
        count = 0

        # Check for a subsequence match in a "substring" (a subset of characters) within
        #sequence
        # If a match, move substring to next potential match in sequence
        # Continue moving substring and checking for matches until out of consecutive matches
        while True:

            # Adjust substring start and end
            start = i + count * subsequence_length
            end = start + subsequence_length

            # If there is a match in the substring
            if sequence[start:end] == subsequence:
                count += 1
        
            # If there is no match in the substring
            else:
                break
    
        # Update most consecutive matches found
        longest_run = max(longest_run, count)

    # After checking for runs at each character in seqeuence, return longest run found
    return longest_run

main()

Answer 1

你还在做这件事吗？如果是这样，则有 2 个数据库和 20 个序列需要测试。（它们在 DNA PSET 末尾列出了正确答案。）哪一个给出了上述错误？我怀疑这是第三次测试。它说运行你的程序为

python dna.py databases/small.csv sequences/3.txt

。你的程序应该输出

No match

。

当我这样做时，你的程序输出

Charlie

而不是

No match

。
您需要检查的子序列是：

['AGATC', 'AATG', 'TATC']

您的子序列计数是：

{'AGATC': 3, 'AATG': 3, 'TATC': 5}

这与小.csv 文件中的任何人都不匹配。
Charlie 很接近，但他的 DNA 子序列计数是：

('AGATC', '3'), ('AATG', '2'), ('TATC', '5')

当您将每个人与后续计数进行比较时，就会出现错误。有 3 件事需要解决：

```
match
```
的值设置为上一个循环（
```
for subsequence in subsequences:
```
）。 In 需要处于
```
for person in database:
```
循环中。
需要修改要测试的缩进
```
match
```
。（这是在第二个 for
```
subsequence in subsequences:
```
循环内。）
您正在针对
```
match
```
测试
```
len(subsequence)
```
。想想看....

我做了这些更改，它适用于所有 4 个

small.csv

测试和我尝试过的 3 个

large.csv

。

cs50 第 6 周 DNA 程序错误识别 DNA 序列

问题描述投票：0回答：1

1个回答

最新问题

cs50 第 6 周 DNA 程序错误识别 DNA 序列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1