我正在研究此python问题:
给出以字符串形式存储的DNA碱基{A,C,G,T}的序列,在数据结构中返回条件概率表,以便可以查询一个碱基(b1),然后查询第二个碱基(b1)。 b2),以获得第二个碱基在第一个碱基之后立即出现的概率p(b2 | b1)。 (假设seq的长度> = 3,并且从未一起见过的任何b1和b2的概率为0。忽略b1将在字符串末尾跟随的概率。)
您可以使用collections模块,但不能使用其他库。
但是我遇到了障碍:
word = 'ATCGATTGAGCTCTAGCG'
def dna_prob2(seq):
tbl = dict()
levels = set(word)
freq = dict.fromkeys(levels, 0)
for i in seq:
freq[i] += 1
for i in levels:
tbl[i] = {x:0 for x in levels}
lastlevel = ''
for i in tbl:
if lastlevel != '':
tbl[lastlevel][i] += 1
lastlevel = i
for i in tbl:
print(i,tbl[i][i] / freq[i])
return tbl
tbl['T']['T'] / freq[i]
基本上,最终结果应该是您在上面看到的最后一行tbl
。但是,当我尝试在print(i,tbl[i][i] /freq[i)
中执行该操作并运行dna_prob2(word)
时,我得到的所有结果均为0.0s。
很想知道这里是否有人可以帮忙。
谢谢!
def makeprobs(word):
singles = {}
probs = {}
thedict={}
ll = len(word)
for i in range(ll-1):
x1 = word[i]
x2 = word[i+1]
singles[x1] = singles.get(x1, 0)+1.0
thedict[(x1, x2)] = thedict.get((x1, x2), 0)+1.0
for i in thedict:
probs[i] = thedict[i]/singles[i[0]]
return probs
word = 'ATCGATTGAGCTCTAGCG'
def dna_prob2(seq):
tbl = dict()
levels = set(seq)
freq = dict.fromkeys(levels, 0)
for i in seq:
freq[i] += 1
for i in levels:
tbl[i] = {x:0 for x in levels}
lastlevel = ''
for i in seq:
if lastlevel != '':
tbl[lastlevel][i] += 1
lastlevel = i
return tbl, freq
condfreq, freq = dna_prob2(word)
print(condfreq['T']['T']/freq['T'])
print(condfreq['G']['A']/freq['A'])
print(condfreq['C']['G']/freq['G'])
希望这会有所帮助。