如果n为3,则序列“ATATATATAG”包含4x“ATA”、3x“TAT”和1x“TAG”。因此该比例为4/8=0.5。该数字越高,序列的重复次数就越多。
编写一个函数
simple(s,n)
,其中 s
是序列,n
是要考虑的 k 聚体的长度。该函数应返回上述比例。
有人可以帮我解决这个问题吗?
from nltk import ngrams
from collections import Counter
def simple(seq, n):
return Counter(ngrams(seq, n)).most_common(1)[0][1] / float(len(seq) - n + 1)
这看起来像家庭作业,但至少是脑筋急转弯的那种。
提示:
itertools
、generators
和collections
对于解决此类问题非常方便。
import itertools
import collections
ACIDS = ('A', 'C', 'T', 'G')
def walk_seq(s, chunk_size):
assert len(s) >= chunk_size
for i in range(0, len(s) - chunk_size + 1):
yield s[i:i+chunk_size]
def simple(s, n):
snip_counts = collections.defaultdict(int)
for chunk in walk_seq(s, n):
for snip_tuple in itertools.product(ACIDS, repeat=n):
snip = ''.join(snip_tuple)
if chunk == snip:
snip_counts[snip] += 1
total_matches = sum(snip_counts.values())
maxi = max(snip_counts.values())
return float(maxi) / total_matches
print simple('ATATATATAG', 3)
这是一道非常好的算法题,你也可以自己尝试一下,但这里是一个几乎没有挑战性的解决方案。
s = "ATATATATAG"
n = 3
def simple(s,n):
dictionary = {}
total = 0
for i in range (len(s)-(n-1)): # (n-1) to get last element
k = i+n
if s[i:k] in dictionary:
dictionary[s[i:k]] += 1
else:
dictionary.update({s[i:k]:1})
total += 1 # doing it here to avoid sum(dictionary.values())
for key, value in dictionary.items():
dictionary[key] = value/total
# As a challenge, edit the line above to lambda function
print(dictionary)
simple(s,n)
# sample output
#{'TAT': 0.375, 'ATA': 0.5, 'TAG': 0.125}
就这么简单,使用集合中的 defaultdict 和 for 循环:
>>> from collections import defaultdict
>>>
>>> def kmer_counter(seq, k):
... kmers = defaultdict(int)
... for i in range(len(seq) - k + 1):
... kmer = seq[i:i+k]
... kmers[kmer] = kmers.get(kmer, 0) + 1
... return kmers
...
>>> s = "ATATATATAG"
>>> n = 3
>>> kmer_counter(s, n)
defaultdict(<class 'int'>, {'ATA': 4, 'TAT': 3, 'TAG': 1})