如何对Fasta文件的标题进行分组

问题描述 投票:0回答:1

我的fasta文件的标题如下:

>ref|NC_001133| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=I]

>ref|NC_001134| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=II]

>ref|NC_001135| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=III]

>ref|NC_001136| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=IV]

>ref|NC_001137| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=V]

>ref|NC_001138| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=VI]

>ref|NC_001139| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=VII]

>ref|NC_001140| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=VIII]

>ref|NC_001141| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=IX]

>ref|NC_001142| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=X]

>ref|NC_001143| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=XI]

>ref|NC_001144| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=XII]

>ref|NC_001145| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=XIII]

>ref|NC_001146| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=XIV]

>ref|NC_001147| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=XV]

>ref|NC_001148| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=XVI]

>ref|NC_001224| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [location=mitochondrion] [top=circular]

我需要为相应的位置替换每个对应的>ref|NC_001133|,例如'[chromosome = I]',因为我想为即将到来的运行获取正确的格式,但是首先我想使用正则表达式对标头的每个部分进行分组;但是,在最后一行,线粒体部分使我难以正确地对每个项目进行分组。我真的希望您可以使用正则表达式来帮助分组。

这是我尝试执行的代码的一部分:

#!/usr/bin/env python


import re
import subprocess
from sys import argv

def get_fasta_rec(input_fasta):
        """Find all FASTA entries in a FASTA file, change the headers and return them in a dictionary.

        input_fasta -- FASTA file name
        record_dict -- dict, {header:seq}
        """
        entries = input_fasta.split(">")[1:]
        dict_entry = {}
        for entry in entries:
                header, x, seq = entry.partition("\n")
                m = re.search("(.+) (.+\s.+) (.+) (.+) (.+|('[location=mitochondrion] [top=circular]'))", header)

                if m:
                    ref = m.group(1)

                    org = m.group(2)

                    strain = m.group(3)

                    moltype = m.group(4)

                    location = m.group(5)


if __name__ == '__main__':
        input_fasta = open(argv[1]).read()
        get_fasta_rec(input_fasta)

我想为每个标题获得的输出是:

> [chromosome=I] [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [ref|NC_001133|]
> [location=mitochondrion] [top=circular] [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [ref|NC_001224|]

谢谢您的帮助。

python regex
1个回答
0
投票

请参阅Regex101以获取对]的解释>

(?:(>ref\|.+\|)\s)?((?:\[[\S\d]+=[^\]]+\])+)\s?

您将需要使用re.findall(...),并且您的属性组将在每个匹配项的组2中。

© www.soinside.com 2019 - 2024. All rights reserved.