使用Python检索基因本体层次结构

问题描述 投票:0回答:1

我正在尝试使用 Python 从 OBO 文件中解析和分层显示基因本体 (GO) 术语。虽然我取得了进展,但我遇到了在同一术语内正确处理多个 is_a 关系的问题。我的目标是实现一个考虑所有 is_a 关系的层次结构。

我正在使用 go-basic.obo 文件中的基因本体数据的子集。这是数据格式的示例:

    format-version: 1.2
data-version: releases/2023-06-11
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: prokaryote_subset "GO subset for prokaryotes"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
ontology: go

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

[Term]
id: GO:0048308
name: organelle inheritance
namespace: biological_process
def: "The partitioning of organelles between daughter cells at cell division." [GOC:jid]
subset: goslim_pir
subset: goslim_yeast
is_a: GO:0006996 ! organelle organization

[Term]
id: GO:0007029
name: endoplasmic reticulum organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the endoplasmic reticulum." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "endoplasmic reticulum morphology" RELATED []
synonym: "endoplasmic reticulum organisation" EXACT []
synonym: "endoplasmic reticulum organization and biogenesis" RELATED [GOC:mah]
synonym: "ER organisation" EXACT []
synonym: "ER organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization

[Term]
id: GO:0048309
name: endoplasmic reticulum inheritance
namespace: biological_process
def: "The partitioning of endoplasmic reticulum between daughter cells at cell division." [GOC:jid]
synonym: "ER inheritance" EXACT []
is_a: GO:0007029 ! endoplasmic reticulum organization
is_a: GO:0048308 ! organelle inheritance

[Term]
id: GO:0048313
name: Golgi inheritance
namespace: biological_process
def: "The partitioning of Golgi apparatus between daughter cells at cell division." [GOC:jid, PMID:12851069]
synonym: "Golgi apparatus inheritance" EXACT []
synonym: "Golgi division" EXACT [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi partitioning" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:0048308 ! organelle inheritance

[Term]
id: GO:0007030
name: Golgi organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the Golgi apparatus." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "Golgi apparatus organization" EXACT []
synonym: "Golgi organisation" EXACT []
synonym: "Golgi organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization

[Term]
id: GO:0090166
name: Golgi disassembly
namespace: biological_process
def: "A cellular process that results in the breakdown of a Golgi apparatus that contributes to Golgi inheritance." [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi apparatus disassembly" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:1903008 ! organelle disassembly
relationship: part_of GO:0048313 ! Golgi inheritance

[Term]
id: GO:1903008
name: organelle disassembly
namespace: biological_process
def: "The disaggregation of an organelle into its constituent components." [GO_REF:0000079, GOC:TermGenie]
synonym: "organelle degradation" EXACT []
is_a: GO:0006996 ! organelle organization
is_a: GO:0022411 ! cellular component disassembly


[Term]
id: GO:0006996
name: organelle organization
namespace: biological_process
alt_id: GO:1902589
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of an organelle within a cell. An organelle is an organized structure of distinctive morphology and function. Includes the nucleus, mitochondria, plastids, vacuoles, vesicles, ribosomes and the cytoskeleton. Excludes the plasma membrane." [GOC:mah]
subset: goslim_candida
subset: goslim_pir
synonym: "organelle organisation" EXACT []
synonym: "organelle organization and biogenesis" RELATED [GOC:dph, GOC:jl, GOC:mah]
synonym: "single organism organelle organization" EXACT [GOC:TermGenie]
synonym: "single-organism organelle organization" RELATED []
is_a: GO:0016043 ! cellular component organization

我用了这个代码

def parse_obo(file_path):
    terms = {}
    current_term = None
    
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            if not line:
                if current_term:
                    terms[current_term['id']] = current_term
                    current_term = None
            elif line.startswith('[Term]'):
                if current_term:
                    terms[current_term['id']] = current_term
                current_term = {'id': ''}
            elif current_term:
                parts = line.split(': ', 1)
                if len(parts) == 2:
                    current_term[parts[0]] = parts[1]
    
    return terms

def display_hierarchy(terms, term_id, indent=0):
    if term_id in terms:
        term = terms[term_id]
        print(' ' * indent + term_id)
        
        if 'is_a' in term:
            parent_ids = [parent.split()[1] for parent in term['is_a'] if len(parent.split()) > 1]
            for parent_id in parent_ids:
                display_hierarchy(terms, parent_id, indent + 4)

        if 'id' in term:
            child_ids = [child_id for child_id in terms if term_id in terms[child_id].get('is_a', [])]
            for child_id in child_ids:
                display_hierarchy(terms, child_id, indent + 4)

if __name__ == "__main__":
    file_path = 'go-basic_1.obo'
    terms = parse_obo(file_path)
    
    for term_id in terms:
        display_hierarchy(terms, term_id, indent=0)

我就是这样的

GO:0000001
GO:0048308
    GO:0048309
    GO:0048313
GO:0007029
GO:0048309
GO:0048313
GO:0007030
GO:0090166
GO:1903008
    GO:0090166
GO:0006996
    GO:0048308
        GO:0048309
        GO:0048313
    GO:0007029
    GO:0007030

但我想要这样的结果

GO:0006996
    GO:1903008
        GO:0090166
    GO:0048308
        GO:0048313
        GO:0000001
        GO:0048309
    GO:0007029
        GO:0048309
    GO:0007030
        GO:0048313
        GO:0090166
GO:0048311
    GO:0000001

我想根据我的基因组数据绘制基因本体的结果,所以我从这里开始,请帮忙

python python-3.x recursion bioinformatics ontology-mapping
1个回答
0
投票

您需要注意以下几点:

  • 由于

    is_a
    每个项目可能会出现多次,因此您需要将它们收集在一个集合中,否则您将覆盖以前的值并且仅保留每个术语遇到的最后一个值。我会概括这一点,并使术语中的所有项目都具有列表值,除了
    id
    ,每个术语应该只出现一次

  • 要显示层次结构,您将受益于父级到子级的关系,而不是子级到父级的关系。因此,我建议添加一个单独的函数来将这种反向关系添加到术语中。

看起来是这样的:

def parse_obo(file_path):
    terms = {}
    current_term = {}
    isterm = False
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            isterm = isterm or line.startswith('id:')
            if isterm and ": " in line:
                key, value = line.split(': ', 1)
                if key == "id":
                    current_term = terms.setdefault(value, {})
                    current_term["id"] = value
                else:
                    current_term.setdefault(key, []).append(value)
    return terms

def make_hierarchy(terms):
    roots = set(terms.keys())
    for id, term in terms.items():
        term.setdefault("children", [])
        if "is_a" in term:
            for is_a in term["is_a"]:
                parent = is_a.split()[0]
                if parent in terms:
                    terms[parent].setdefault("children", []).append(term)
                    roots.discard(id)
    return [term for id, term in terms.items() if id in roots]
    
def display_hierarchy(terms, indent=""):
    for term in terms:
        print(f"{indent}{term['id']}")
        display_hierarchy(term['children'], indent + "  ")

if __name__ == "__main__":
    file_path = 'go-basic_1.obo'
    terms = parse_obo(file_path)
    roots = make_hierarchy(terms)
    display_hierarchy(roots)
© www.soinside.com 2019 - 2024. All rights reserved.