使用Biopython从python中的.fasta基因提取基因起始位置

问题描述 投票:0回答:1

我有一个.fasta文件,其中包含多个基因。它们都有类似的描述,例如:

>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]

我正在尝试提取所有这些基因的基因起始位置(即,上面示例中的“ 1”)。我尝试了以下代码,但似乎无法正常工作。

from Bio import SeqIO
genes = fasta_file.fasta
records = SeqIO.parse(open(genes), 'fasta')
record = next(records)
parts = record.description.split("..")
print(parts[0])

任何帮助或资源将不胜感激!

python biopython genetics
1个回答
0
投票

这对我有用。希望有帮助。

import re
from Bio import SeqIO

genes = "fasta_file.fasta"
records = SeqIO.parse(genes, 'fasta')

# fasta_file.fasta file has this line only.
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]

您可以使用SeqIO.parse(filename, "fasta)获取记录。要检查此,

for record in SeqIO.parse(genes, 'fasta'):
    print(record)

下面给出。并且record.description具有字符串信息。

ID:lcl | NZ_LN831034.1_cds_WP_002987659.1_1名称: lcl | NZ_LN831034.1_cds_WP_002987659.1_1说明: lcl | NZ_LN831034.1_cds_WP_002987659.1_1 [gene = dnaA] [locus_tag = B6D67_RS00005] [db_xref = GeneID:46805773] [蛋白质=染色体复制起始蛋白质DnaA] [protein_id = WP_002987659.1] [location = 1..1356] [gbkey = CDS] 功能:0 Seq('',SingleLetterAlphabet())

使用正则表达式在“ location =”之后获取数字。

ma = re.search("location=(\d+)\.\.\d+", record.description)
ma.groups()[0] # 1
© www.soinside.com 2019 - 2024. All rights reserved.