我目前正在使用一个
fasta
文件(文本文件),其中包含 DNA 提取序列(重叠群)列表,每个序列都有一个标头,后面跟着几行核苷酸,这是该重叠群的核苷酸长度。有 120 个重叠群,每个条目都标有以“>”开头的行,表示序列信息。在这一行之后,给出了该序列的核苷酸长度。
示例:
>gi|571136972|ref|XM_006625214.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 5 (Rps5) (rps5) mRNA, complete cds
ATGAGAAATATTTTATTAAAGAAAAAATTATATAATAGTAAAAATATTTATATTTTATATTATATTTTAATAATATTTAAAAGTATTTTTATTATTTTATTTAATAGTAAATATAATGTGAATTATTATTTATATAATAAAATTTATAATTTATTTATTATATATATAAAATTATATTATATTATAAATAATATATATTATAATAATAATTATTATTATATATATAATATGAATTATATA
TATTTTTATATTTATAAATATAATAGTTTAAATAATA
>gi|571136996|ref|XM_006625226.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 2 (Rps2) (rps2) mRNA, complete cds
ATGTTTATTACATTTAAAGATTTATTAAAATCTAAAATATATATAGGAAATAATTATAAAAATATTTATATTAATAATTATAAATTTATATATAAAATAAAATATAATTATTGTATTTTAAATTTTACATTAATTATATTATATTTATATAAATTATATTTATATATTTATAATATATCTATATTTAATAATAAAATTTTATTTATTATTAATAATAATTTAATTACAAATTTAATTATT
AATATATGTAATTTAACTAATAATTTTTATATTATTA
我想做的是列出每个重叠群。我的问题是,我不知道告诉 Python 所需的语法:
我想要一个整数列表,这样我就可以计算重叠群的平均长度、标准差、酷基因方程等。
不要重新发明轮子,按照马丁的建议使用biopython。这是一个将序列 ID 和长度打印到终端的开始。您可以使用pip安装biopython,即
pip install biopython
from Bio import SeqIO
import sys
FileIn = sys.argv[1]
handle = open(FileIn, 'rU')
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
print "%s: %i bp" % (record.id, length) #print sequence ID: seq length
或者您可以将结果存储在字典中:
handle = open(FileIn, 'rU')
sequence_lengths = {}
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
sequence_lengths[record.id] = length
#access dictionary outside of loop
print sequence_lengths
这可能对您有用:它会在包含
>
: 的行后面的行中打印 ACGT 的数量
import re
with open("input.txt") as input_file:
data = input_file.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print(data)
感谢您的所有帮助。我已经研究了biopython 的东西,很高兴能够理解它并合并它。这次作业的总体目标是教我如何理解Python,而不是直接找到解决方案,或者至少如果我找到了解决方案,我必须能够用自己的话解释它。
无论如何,我已经创建了一个包含该元素以及其他元素的代码。我还有几件事要做,如果我有困惑,我会回来问。
这是我除了直接与我的主管一起工作或我制作和理解的教程之外的第一个工作代码(哇!):
import re
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
contigs = 0
for line in fasta:
if line.strip().startswith('>'):
contigs = contigs + 1
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
data = fasta.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print "Total number of contigs: %s" %contigs
total_contigs = sum(data)
N50 = sum(data)/2
print "number used to determine N50 = %s" %N50
average = 0
total = 0
for n in data:
total = total + n
mean = total / len(data)
print "mean length of contigs: %s" %mean
print "total nucleotides in fasta = %s" %total_contigs
#print "list of contigs by length: %s" %sorted([data])
l = data
l.sort(reverse = True)
print "list of contigs by length: %s" %l
这就是我想要的,但如果您有任何意见或建议,我很乐意听到。
接下来,用这份甜蜜的甜蜜清单来确定N50。再次感谢!
我创建了一个函数来计算 N50,它似乎运行良好。我可以解析命令行并通过程序运行任何 .fa 文件
def calc_n50(array):
array.sort(reverse = True)
n50 = 0 #sums lengths
n = 0 #n50 sequence
half = sum(array)/2
for val in array:
n50 += val
if n50 >= half:
n = val
break #breaks loop when condition is met
print "N50 is",n