我有一个.fasta文件,其中包含多个基因。它们都有类似的描述,例如:
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
我正在尝试提取所有这些基因的基因起始位置(即,上面示例中的“ 1”)。我尝试了以下代码,但似乎无法正常工作。
from Bio import SeqIO
genes = fasta_file.fasta
records = SeqIO.parse(open(genes), 'fasta')
record = next(records)
parts = record.description.split("..")
print(parts[0])
任何帮助或资源将不胜感激!
这对我有用。希望有帮助。
import re
from Bio import SeqIO
genes = "fasta_file.fasta"
records = SeqIO.parse(genes, 'fasta')
# fasta_file.fasta file has this line only.
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
您可以使用SeqIO.parse(filename, "fasta)
获取记录。要检查此,
for record in SeqIO.parse(genes, 'fasta'):
print(record)
下面给出。并且record.description
具有字符串信息。
ID:lcl | NZ_LN831034.1_cds_WP_002987659.1_1名称: lcl | NZ_LN831034.1_cds_WP_002987659.1_1说明: lcl | NZ_LN831034.1_cds_WP_002987659.1_1 [gene = dnaA] [locus_tag = B6D67_RS00005] [db_xref = GeneID:46805773] [蛋白质=染色体复制起始蛋白质DnaA] [protein_id = WP_002987659.1] [location = 1..1356] [gbkey = CDS] 功能:0 Seq('',SingleLetterAlphabet())
使用正则表达式在“ location =”之后获取数字。
ma = re.search("location=(\d+)\.\.\d+", record.description)
ma.groups()[0] # 1