我的代码还可以,只是它的工作对象有选择性。它为特定序列提供了正确的名称,但对于其他序列,它会弄乱。
例如,它会正确识别出一条链属于 Bob,但会将假定的“不匹配”链与“Charlie”相匹配,而“Charlie”甚至不存在于 cs50 给我们的列表中。
这真的很奇怪,我对照其他人检查了我的代码,他们似乎大多相似。不知道为什么会这样,希望有帮助。
import csv
import sys
def main():
# TODO: Check for command-line usage
if len(sys.argv) != 3:
sys.exit("Usage: python dna.py data.csv sequence.txt")
# TODO: Read database file into a variable
database = []
with open(sys.argv[1], 'r') as file:
reader = csv.DictReader(file)
for row in reader:
database.append(row)
# TODO: Read DNA sequence file into a variable
with open(sys.argv[2], 'r') as file:
dna_sequence = file.read()
# TODO: Find longest match of each STR in DNA sequence
subsequences = list(database[0].keys())[1:]
results = {}
for subsequence in subsequences:
match = 0
results[subsequence] = longest_match(dna_sequence, subsequence)
match += 1
# TODO: Check database for matching profiles
for person in database:
for subsequence in subsequences:
if int(person[subsequence]) == results[subsequence]:
match += 1
if match == len(subsequence):
print(person["name"])
return
print("No match")
return
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within
#sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i + count * subsequence_length
end = start + subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count += 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()
你还在做这件事吗?如果是这样,则有 2 个数据库和 20 个序列需要测试。 (它们在 DNA PSET 末尾列出了正确答案。)哪一个给出了上述错误?我怀疑这是第三次测试。它说运行你的程序为
python dna.py databases/small.csv sequences/3.txt
。你的程序应该输出 No match
。
当我这样做时,你的程序输出
Charlie
而不是 No match
。['AGATC', 'AATG', 'TATC']
{'AGATC': 3, 'AATG': 3, 'TATC': 5}
('AGATC', '3'), ('AATG', '2'), ('TATC', '5')
当您将每个人与后续计数进行比较时,就会出现错误。有 3 件事需要解决:
match
的值设置为上一个循环(for subsequence in subsequences:
)。 In 需要处于 for person in database:
循环中。match
。 (这是在第二个 for subsequence in subsequences:
循环内。)match
测试 len(subsequence)
。想想看....我做了这些更改,它适用于所有 4 个
small.csv
测试和我尝试过的 3 个 large.csv
。