KeyError:'m',使用Bio.codonalign建立密码子比对

问题描述 投票:0回答:1

我正在尝试使用Bio.codonalign进行基于蛋白质比对的密码子比对两个基因序列。他们的示例在此处(在“ build”功能下)给出:https://biopython.org/DIST/docs/api/Bio.codonalign-module.html。我已经尝试了他们的例子,并且成功了。

现在,我希望从FASTA文件中获得序列(ap_20具有对齐的蛋白质,而ug_20具有对齐的基因)。以下是我的代码。

# Import packages
from Bio.Alphabet import generic_dna, generic_protein
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
from Bio.codonalign import build

# Define set of orthologous genes and proteins
genes = list(SeqIO.parse("ug_20.fasta", "fasta"))
proteins = list(SeqIO.parse("ap_20.fasta", "fasta"))

# Assign individual sequences to variables
seq1 = SeqRecord(Seq(str(genes[0].seq), alphabet=generic_dna), id="pro1")
seq2 = SeqRecord(Seq(str(genes[1].seq), alphabet=generic_dna), id="pro2")

pro1 = SeqRecord(Seq(str(proteins[0].seq), alphabet=generic_protein), id="pro1")
pro2 = SeqRecord(Seq(str(proteins[1].seq), alphabet=generic_protein), id="pro2")

# MultipleSeqAlignment reads the protein alignment
aln = MultipleSeqAlignment([pro1, pro2])
print(aln)

# Build codon alignment
codon_aln = build(aln, [seq1, seq2])
print(codon_aln)

aln有效,但这是最后一个build步骤无效。我收到以下错误。我不确定KeyError: 'm'是什么意思,但是我知道我所有的蛋白质序列都以字母'm'开头。我将文件路径的一部分替换为“ ...”,以使其简短。

Traceback (most recent call last):
  File "/Users/.../tempCodeRunnerFile.py", line 30, in <module>
    codon_aln = build(aln, [seq1, seq2])
  File "/Users/.../anaconda3/lib/python3.6/site-packages/Bio/codonalign/__init__.py", line 168, in build
    anchor_len=anchor_len)
  File "/Users/.../anaconda3/lib/python3.6/site-packages/Bio/codonalign/__init__.py", line 261, in _check_corr
    pro_re += aa2re[aa]
KeyError: 'm'
python biopython keyerror
1个回答
0
投票

您不提供输入文件ug_20.fastaap_20.fasta(的一部分),这使我调试起来更加困难,但是我可以使用以下代码触发类似的错误:

>>> from Bio.Alphabet import generic_dna, generic_protein
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Align import MultipleSeqAlignment
>>> from Bio.codonalign import build
>>> seq1 = SeqRecord(Seq('ATGTCTCGT', alphabet=generic_dna), id='pro1')
>>> seq2 = SeqRecord(Seq('ATGCGT', alphabet=generic_dna), id='pro2')
>>> pro1 = SeqRecord(Seq('MSR', alphabet=generic_protein), id='pro1')
>>> pro2 = SeqRecord(Seq('m-R', alphabet=generic_protein), id='pro2')
>>> aln = MultipleSeqAlignment([pro1, pro2])
>>> codon_aln = build(aln, [seq1, seq2])
>>> print(codon_aln)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-25-da1b827fb67e> in <module>()
      9 pro2 = SeqRecord(Seq('m-R', alphabet=generic_protein), id='pro2')
     10 aln = MultipleSeqAlignment([pro1, pro2])
---> 11 codon_aln = build(aln, [seq1, seq2])
     12 print(codon_aln)

1 frames
/usr/local/lib/python3.6/dist-packages/Bio/codonalign/__init__.py in _check_corr(pro, nucl, gap_char, codon_table, complete_protein, anchor_len)
    259     for aa in pro.seq:
    260         if aa != gap_char:
--> 261             pro_re += aa2re[aa]
    262 
    263     nucl_seq = str(nucl.seq.upper().ungap(gap_char))

KeyError: 'm'

这是默认的Bio.codealign.build示例,只有一个更改:在pro2中,我将'M-R'更改为'm-R'。因此,这向我暗示您的蛋白质序列之一包含小写字符,而Bio.codealign.build()似乎期望使用大写字符。您可以像这样将蛋白质序列转换为大写:

pro1 = SeqRecord(Seq(str(proteins[0].seq.upper()), alphabet=generic_protein), id="pro1")
pro2 = SeqRecord(Seq(str(proteins[1].seq.upper()), alphabet=generic_protein), id="pro2")
© www.soinside.com 2019 - 2024. All rights reserved.