你怎么能在python中分析fna.gz?

问题描述 投票:0回答:1

鉴于我的fna.gz基因组输入,我想返回第n个碱基对。从理论上讲它会像这样工作:

allele = genome[14325]
print(allele)
#: G

这是我现在的代码:

from Bio import SeqIO
import gzip
from Bio.Alphabet import generic_dna

input_file = r"C:\Users\blake\PycharmProjects\Transcendence3.0\DNA\GCF_000001405.38_GRCh38.p12_genomic.fna.gz"
output_file = r"C:\Users\blake\PycharmProjects\Transcendence3.0\DNA\Probabilities"

with gzip.open(input_file, "rt") as handle:
    for record in SeqIO.parse(input_file, "fasta", generic_dna):
        fasta_sequences = SeqIO.parse(open(input_file), 'fasta')
        print("seq parsed")
        with open(output_file) as out_file:
            for fasta in fasta_sequences:
                name, sequence = fasta.id, str(fasta.seq)
                new_allele = tell_basepair(sequence)
                write_fasta(out_file)

def tell_basepair(n, seq):
  bp = seq[n-1]
  return bp

但它不起作用,我收到一个错误:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 386: character maps to <undefined>
python bioinformatics biopython
1个回答
0
投票

你可以

  • 尝试先打开.gz文件,然后再将其读作fasta [1]
    with gzip.open("practicezip.fasta.gz", "rt") as handle:
        for record in SeqIO.parse(handle, "fasta"):
            #your code
  • 如果错误仍然存​​在,请在解析函数中指定编码。正如BioPython SeqIO手册所述:“对于像FASTA那样无法确定字母表的文件格式,明确指定字母表可能很有用”。所以:
    from Bio import SeqIO
    from Bio.Alphabet import generic_dna 
    filename = "yourfastafilename"
    for record in SeqIO.parse(filename, "fasta", generic_dna):
        # your code

除了UnicodeDecodeError错误之外,您可能还需要定义函数some_function(sequence),否则Python在调用它时将不知道该怎么做。例如:

def tell_basepair(n, seq):
  bp = seq[n-1]
  return bp
© www.soinside.com 2019 - 2024. All rights reserved.