我正在尝试使用 Python 从 OBO 文件中解析和分层显示基因本体 (GO) 术语。虽然我取得了进展,但我遇到了在同一术语内正确处理多个 is_a 关系的问题。我的目标是实现一个考虑所有 is_a 关系的层次结构。
我正在使用 go-basic.obo 文件中的基因本体数据的子集。这是数据格式的示例:
format-version: 1.2
data-version: releases/2023-06-11
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: prokaryote_subset "GO subset for prokaryotes"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
ontology: go
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
[Term]
id: GO:0048308
name: organelle inheritance
namespace: biological_process
def: "The partitioning of organelles between daughter cells at cell division." [GOC:jid]
subset: goslim_pir
subset: goslim_yeast
is_a: GO:0006996 ! organelle organization
[Term]
id: GO:0007029
name: endoplasmic reticulum organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the endoplasmic reticulum." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "endoplasmic reticulum morphology" RELATED []
synonym: "endoplasmic reticulum organisation" EXACT []
synonym: "endoplasmic reticulum organization and biogenesis" RELATED [GOC:mah]
synonym: "ER organisation" EXACT []
synonym: "ER organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization
[Term]
id: GO:0048309
name: endoplasmic reticulum inheritance
namespace: biological_process
def: "The partitioning of endoplasmic reticulum between daughter cells at cell division." [GOC:jid]
synonym: "ER inheritance" EXACT []
is_a: GO:0007029 ! endoplasmic reticulum organization
is_a: GO:0048308 ! organelle inheritance
[Term]
id: GO:0048313
name: Golgi inheritance
namespace: biological_process
def: "The partitioning of Golgi apparatus between daughter cells at cell division." [GOC:jid, PMID:12851069]
synonym: "Golgi apparatus inheritance" EXACT []
synonym: "Golgi division" EXACT [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi partitioning" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:0048308 ! organelle inheritance
[Term]
id: GO:0007030
name: Golgi organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the Golgi apparatus." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "Golgi apparatus organization" EXACT []
synonym: "Golgi organisation" EXACT []
synonym: "Golgi organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization
[Term]
id: GO:0090166
name: Golgi disassembly
namespace: biological_process
def: "A cellular process that results in the breakdown of a Golgi apparatus that contributes to Golgi inheritance." [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi apparatus disassembly" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:1903008 ! organelle disassembly
relationship: part_of GO:0048313 ! Golgi inheritance
[Term]
id: GO:1903008
name: organelle disassembly
namespace: biological_process
def: "The disaggregation of an organelle into its constituent components." [GO_REF:0000079, GOC:TermGenie]
synonym: "organelle degradation" EXACT []
is_a: GO:0006996 ! organelle organization
is_a: GO:0022411 ! cellular component disassembly
[Term]
id: GO:0006996
name: organelle organization
namespace: biological_process
alt_id: GO:1902589
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of an organelle within a cell. An organelle is an organized structure of distinctive morphology and function. Includes the nucleus, mitochondria, plastids, vacuoles, vesicles, ribosomes and the cytoskeleton. Excludes the plasma membrane." [GOC:mah]
subset: goslim_candida
subset: goslim_pir
synonym: "organelle organisation" EXACT []
synonym: "organelle organization and biogenesis" RELATED [GOC:dph, GOC:jl, GOC:mah]
synonym: "single organism organelle organization" EXACT [GOC:TermGenie]
synonym: "single-organism organelle organization" RELATED []
is_a: GO:0016043 ! cellular component organization
我用了这个代码
def parse_obo(file_path):
terms = {}
current_term = None
with open(file_path, 'r') as f:
for line in f:
line = line.strip()
if not line:
if current_term:
terms[current_term['id']] = current_term
current_term = None
elif line.startswith('[Term]'):
if current_term:
terms[current_term['id']] = current_term
current_term = {'id': ''}
elif current_term:
parts = line.split(': ', 1)
if len(parts) == 2:
current_term[parts[0]] = parts[1]
return terms
def display_hierarchy(terms, term_id, indent=0):
if term_id in terms:
term = terms[term_id]
print(' ' * indent + term_id)
if 'is_a' in term:
parent_ids = [parent.split()[1] for parent in term['is_a'] if len(parent.split()) > 1]
for parent_id in parent_ids:
display_hierarchy(terms, parent_id, indent + 4)
if 'id' in term:
child_ids = [child_id for child_id in terms if term_id in terms[child_id].get('is_a', [])]
for child_id in child_ids:
display_hierarchy(terms, child_id, indent + 4)
if __name__ == "__main__":
file_path = 'go-basic_1.obo'
terms = parse_obo(file_path)
for term_id in terms:
display_hierarchy(terms, term_id, indent=0)
我就是这样的
GO:0000001
GO:0048308
GO:0048309
GO:0048313
GO:0007029
GO:0048309
GO:0048313
GO:0007030
GO:0090166
GO:1903008
GO:0090166
GO:0006996
GO:0048308
GO:0048309
GO:0048313
GO:0007029
GO:0007030
但我想要这样的结果
GO:0006996
GO:1903008
GO:0090166
GO:0048308
GO:0048313
GO:0000001
GO:0048309
GO:0007029
GO:0048309
GO:0007030
GO:0048313
GO:0090166
GO:0048311
GO:0000001
我想根据我的基因组数据绘制基因本体的结果,所以我从这里开始,请帮忙
您需要注意以下几点:
由于
is_a
每个项目可能会出现多次,因此您需要将它们收集在一个集合中,否则您将覆盖以前的值并且仅保留每个术语遇到的最后一个值。我会概括这一点,并使术语中的所有项目都具有列表值,除了 id
,每个术语应该只出现一次
要显示层次结构,您将受益于父级到子级的关系,而不是子级到父级的关系。因此,我建议添加一个单独的函数来将这种反向关系添加到术语中。
看起来是这样的:
def parse_obo(file_path):
terms = {}
current_term = {}
isterm = False
with open(file_path, 'r') as f:
for line in f:
line = line.strip()
isterm = isterm or line.startswith('id:')
if isterm and ": " in line:
key, value = line.split(': ', 1)
if key == "id":
current_term = terms.setdefault(value, {})
current_term["id"] = value
else:
current_term.setdefault(key, []).append(value)
return terms
def make_hierarchy(terms):
roots = set(terms.keys())
for id, term in terms.items():
term.setdefault("children", [])
if "is_a" in term:
for is_a in term["is_a"]:
parent = is_a.split()[0]
if parent in terms:
terms[parent].setdefault("children", []).append(term)
roots.discard(id)
return [term for id, term in terms.items() if id in roots]
def display_hierarchy(terms, indent=""):
for term in terms:
print(f"{indent}{term['id']}")
display_hierarchy(term['children'], indent + " ")
if __name__ == "__main__":
file_path = 'go-basic_1.obo'
terms = parse_obo(file_path)
roots = make_hierarchy(terms)
display_hierarchy(roots)