我必须创建我自己的biopython本质上。可以读取DNA,转录和翻译的东西。我已经得到了它的点,它可以做到这一切,但我不能弄清楚如何让程序识别密码子在3组后'atg',直到它达到一个停止密码子。现在它只是找到一个起始密码子,然后找到最近的终止密码子,而不按3计数。谁能帮我解决这个问题?对不起,如果这没有意义
#locate start codons
startcodon=0
n=0
while(n < 1):
startcodon=dataset.find("atg", startcodon, len(dataset)-startcodon)
#locate stop codons
taacodon=dataset.find("taa", startcodon+3, len(dataset)-startcodon)
tagcodon=dataset.find("tag", startcodon+3, len(dataset)-startcodon)
tgacodon=dataset.find("tga", startcodon+3, len(dataset)-startcodon)
if(taacodon<tagcodon):
if(taacodon<tgacodon):
stopcodon=taacodon
#print("taacodon", startcodon)
else:
stopcodon=tgacodon
#print("tGacodon", startcodon)
elif(tgacodon>tagcodon):
stopcodon=tagcodon
#print("taGcodon", startcodon)
else:
stopcodon=tgacodon
#print("tGacodon", startcodon)
#to add sequences to an array
codon.append(dataset[startcodon:stopcodon+3])
if(startcodon > len(dataset) or startcodon < 0):
n = 2;
startcodon=stopcodon
#reverse the string and swap the letters
n=0;
while(n < len(codon)):
rcodon.append (codon[n][len(codon[n])::-1])
#replace a with u
rcodon[n] = re.sub('a', "u", rcodon[n])
#replace t with a
rcodon[n] = re.sub('t', "a", rcodon[n])
#replace c with x
rcodon[n] = re.sub('c', "x", rcodon[n])
#replace g with c
rcodon[n] = re.sub('g', "c", rcodon[n])
#replace x with g
rcodon[n] = re.sub('x', "g", rcodon[n])
print("DNA sequence: ", codon[n] ,'\n', "RNA sequence:", rcodon[n])
n=n+1
answer = 0
print("Total Sequences: ", len(codon)-3)
while (int(answer) >=0):
#str = "Please enter an integer from 0 to " + str(len(dataset)) + " or -1 to quit: "
answer = int(input("Please enter a sequence you would like to see or -1 to quit: "))
if(int(answer) >= 0):
print("DNA sequence: ", codon[int(answer)] ,'\n', "RNA sequence:", rcodon[int(answer)])
dna = codon[int(answer)]
#dna codon table
protein = {"ttt" : "Phe-", "ctt" : "Leu-", "att" : "Ile-", "gtt" : "Val-",
"ttc" : "Phe-", "ctc" : "Leu-", "atc" : "Ile-", "gtc" : "Val-",
"tta" : "Leu-", "cta" : "Leu-", "ata" : "Ile-", "gta" : "Val-",
"ttg" : "Leu-", "ctg" : "Leu-", "atg" : "Met-", "gtg" : "Val-",
"tct" : "Ser-", "cct" : "Pro-", "act" : "Thr-", "gct" : "Ala-",
"tcc" : "Ser-", "ccc" : "Pro-", "acc" : "Thr-", "gcc" : "Ala-",
"tca" : "Ser-", "cca" : "Pro-", "aca" : "Thr-", "gca" : "Ala-",
"tcg" : "Ser-", "ccg" : "Pro-", "acg" : "Thr-", "gcg" : "Ala-",
"tat" : "Tyr-", "cat" : "His-", "aat" : "Asn-", "gat" : "Asp-",
"tac" : "Tyr-", "cac" : "His-", "aac" : "Asn-", "gac" : "Asp-",
"taa" : "STOP", "caa" : "Gin-", "aaa" : "Lys-", "gaa" : "Glu-",
"tag" : "STOP", "cag" : "Gin-", "aag" : "Lys-", "gag" : "Glu-",
"tgt" : "Cys-", "cgt" : "Arg-", "agt" : "Ser-", "ggt" : "Gly-",
"tgc" : "Cys-", "cgc" : "Arg-", "agc" : "Ser-", "ggc" : "Gly-",
"tga" : "STOP", "cga" : "Arg-", "aga" : "Arg-", "gga" : "Gly-",
"tgg" : "Trp-", "cgg" : "Arg-", "agg" : "Arg-", "ggg" : "Gly-"
}
protein_sequence = ""
# Generate protein sequence
for i in range(0, len(dna)-(3+len(dna)%3), 3):
protein_sequence += protein[dna[i:i+3]]
# Print the protein sequence
print ("Protein Sequence: ", protein_sequence)
我一直使用的DNA序列是以 "gtcagaaaagccctctccatgtctactcacgatacatccctgaaaaccactgaggaagtggcttttcagatcatcttgctttgccagtttggggttgggactttgccaatgtatttc "开始的,所以它不是以atg开始的,而是要搜索它.提前感谢任何建议
现在它只是找到一个起始密码子,然后找到最近的终止密码子,而不计算3的数量。
如果你想搜索一个对准特定框架的子串(即一个可被3分割的索引),你可以先将字符串分割成相等的块,然后在结果列表中搜索匹配的块。
例如
dataset_codons = [dataset[i:i+3] for i in range(0, len(dataset), 3)]
# ggtcagaaaaagccctctcca becomes [ggt cag aaa aag ccc tct cca]
try:
startcodon = dataset_codons.index('atg', startcodon, len(dataset_codons) - startcodon)
except ValueError:
break # no more start codons found
(注意 startcodon
将会是与之匹配的chunk的索引。atg
,正好是对应字符串索引的13)
编辑:如果停止密码子只需要和它的起始密码子在同一帧上,但起始密码子可以在任何地方,就会变得有点棘手。在这种情况下,你可以一直搜索停止密码子,直到找到一个索引良好的密码子。
def find_codon(codon, string, start):
i = start + 3
while i < len(string):
i = string.find(codon, i) # find the next substring
if (i - start) % 3 == 0: # check that it's a multiple of 3 after start
return i
return None
startcodon=dataset.find("atg", startcodon)
#locate stop codons
taacodon=find_codon("taa", dataset, startcodon)
tagcodon=find_codon("tag", dataset, startcodon)
tgacodon=find_codon("tga", dataset, startcodon)
stopcodon = min(taacodon, tagcodon, tgacodon)
顺便说一句,我不确定我是否理解了参数的目的 len(dataset)-startcodon
正确。的第三个论点。str.find()
指定了 结束 字符串内的搜索范围,这意味着,作为 startcodon
增加,搜索将在数据集的实际末端停止。
如果你想找起始密码子后按3计数的停止密码子,你可以把起始密码子后的DNA串拆开,然后找出密码子列表里面是否有停止密码子。
sequence = "ggtcagaaaaagccctctccatgtctactcacgatacatccctgaaaaccactgaggaagtggcttttcagatcatcttgctttgccagtttggggttgggacttttgccaatgtatttc"
startcodon = 0
length = len(sequence)
startcodon = sequence.find("atg", startcodon)
# list comprehension. could get codon_list in a for cycle
codon_list = [sequence[i:i+3] for i in range(startcodon+3, length-3, 3)]
# use list.index() to find out whether there is a stop codon in the codon_list
# list.index() would throw error if value is not in the list
try:
taacodon = startcodon + 3 * codon_list.index('taa')
print('taa codon is at {}'.format(taacodon))
except:
print('taa codon is not in the list')
我觉得拆分DNA序列比用 str.find
,因为你需要通过以下方式搜索停止密码子计数 3,如果使用 str.find
,你需要判断发现的停止密码子是否在3的距离。
编辑: 目前,我不知道创建新的字符串列表是否会比在原始字符串中搜索更贵。