从层次结构中提取特定级别的 GO ID(例如 2 级 GO ID)并与带注释的 GO ID 进行匹配

问题描述 投票:0回答:1

我正在编写一个 Python 脚本来匹配文件中带注释的 GO ID,然后从层次结构中提取 2 级基因本体 (GO) ID。但是,我在提取正确的 2 级 GO ID 并显示相应信息时遇到问题。这是我到目前为止所拥有的: 我使用 go-basic.obo 文件对 file.txt 文件中带注释的 GO ID 及其频率(go_id 总数)进行分类。 我期望在 Excel 文件中输出以下列

GO ID 本身。 GO 层次结构中相应的术语名称。 层次结构中的 2 级 GO ID。 file.txt 中 GO ID 的频率。

我尝试使用此代码,但它无法正确提取 corroespong 注释的 go id 的第 2 级的 go id, 这是代码 将 pandas 导入为 pd

从 go-basic.obo 加载你的 GO 层次结构

def parse_obo(file_path):
    terms = {}
    current_term = {}
    isterm = False
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            isterm = isterm or line.startswith('id:')
            if isterm and ": " in line:
                key, value = line.split(': ', 1)
                if key == "id":
                    current_term = terms.setdefault(value, {})
                    current_term["id"] = value
                else:
                    current_term.setdefault(key, []).append(value)
    return terms

def make_hierarchy(terms):
    for term in list(terms.values()):
        term.setdefault("children", [])
        if "is_a" in term:
            for is_a in term["is_a"]:
                parent = is_a.split()[0]
                terms.setdefault(parent, { 'id': parent }).setdefault("children", []).append(term)
    return [term for term in terms.values() if "is_a" not in term]

def calculate_go_frequency(file_path):
    go_frequency = {}
    with open(file_path, 'r') as f:
        for line in f:
            _, go_id = line.strip().split('\t')
            go_frequency[go_id] = go_frequency.get(go_id, 0) + 1
    return go_frequency

def find_level_2_go_id(go_id, terms, hierarchy):
    for term in hierarchy:
        if term["id"] == go_id:
            return None
        
        level_2_id = None
        for child in term.get("children", []):
            if child["id"] == go_id:
                level_2_id = term["id"]
                break
            for sub_child in child.get("children", []):
                if sub_child["id"] == go_id:
                    level_2_id = child["id"]
                    break
            if level_2_id:
                break
        
        if level_2_id:
            return level_2_id
        else:
            level_2_id = find_level_2_go_id(go_id, terms, term.get("children", []))
            if level_2_id:
                return level_2_id
    
    return None

if __name__ == "__main__":
    file_path = 'go-basic.obo'
    terms = parse_obo(file_path)
    roots = make_hierarchy(terms)
    
    # Read and process the file.txt file
    unique_file_path = 'file.txt'
    unique_go_frequency = calculate_go_frequency(unique_file_path)
    
    # Create a list of dictionaries for the data
    data = []
    for go_id, frequency in unique_go_frequency.items():
        term = terms.get(go_id, {})
        term_name = term.get("name", "unknown_term")
        
        # Get the Level 2 GO ID based on the hierarchy
        level_2_id = find_level_2_go_id(go_id, terms, roots)
        
        data.append({
            "GO ID": go_id,
            "Term": term_name,
            "Level 2 GO ID": level_2_id,
            "Frequency": frequency
        })
    
    # Create a DataFrame from the data
    df = pd.DataFrame(data)
    
    # Save the DataFrame to an Excel file
    output_excel = 'gene_counts_with_levels.xlsx'
    df.to_excel(output_excel, index=False)

    print(f"Data saved to {output_excel}")

请帮忙 这个问题也是继上一个问题之后 使用 Python 检索基因本体层次结构 从这个示例文件中,级别将像这样分类,如下所示

level1 GO:0048311
  level2 GO:0000001
level1 GO:0022411
  level2 GO:1903008
    level3 GO:0090166
level1 GO:0016043
  level2 GO:0006996
    level3 GO:0048308
      level4 GO:0000001
      level4 GO:0048309
      level4 GO:0048313
    level3 GO:0007029
      level4 GO:0048309
    level3 GO:0007030
      level4 GO:0048313
      level4 GO:0090166
    level3 GO:1903008
      level4 GO:0090166

这是我的文件

id  GO_ID
OGOAOAJO_00443  GO:0006996
OGOAOAJO_00021  GO:0048313
OGOAOAJO_00021  GO:0048308
OGOAOAJO_00830  GO:0048308
OGOAOAJO_00830  GO:0048308
OGOAOAJO_02897  GO:0000001
OGOAOAJO_02897  GO:0000001
OGOAOAJO_00517  GO:0000001
OGOAOAJO_01362  GO:0000001
OGOAOAJO_02172  GO:0048309
OGOAOAJO_02172  GO:0048313
OGOAOAJO_03684  GO:0007029
OGOAOAJO_03684  GO:0000166
OGOAOAJO_03684  GO:0048309
OGOAOAJO_03684  GO:0007030
OGOAOAJO_01125  GO:0048313
OGOAOAJO_01125  GO:0048313
OGOAOAJO_00657  GO:0090166
OGOAOAJO_00223  GO:0090166
OGOAOAJO_00223  GO:0090166
OGOAOAJO_03140  GO:1903008
OGOAOAJO_03140  GO:0090166
OGOAOAJO_03647  GO:0090166
OGOAOAJO_03407  GO:0090166
OGOAOAJO_00603  GO:0048311
OGOAOAJO_00603  GO:0048311
OGOAOAJO_01401  GO:0000001
OGOAOAJO_00603  GO:0000001

我的输出就像这样

 GO ID  Term    Level 2 GO ID   Frequency
GO:0006996  ['organelle organization']  GO:0016043  1
GO:0048313  ['Golgi inheritance']   GO:0048308  4
GO:0048308  ['organelle inheritance']   GO:0006996  3
GO:0000001  ['mitochondrion inheritance']   GO:0048311  6
GO:0046872  unknown_term        1
GO:0048309  ['endoplasmic reticulum inheritance']   GO:0048308  2
GO:0003824  unknown_term        1
GO:0007029  ['endoplasmic reticulum organization']  GO:0006996  1
GO:0007030  ['Golgi organization']  GO:0006996  1
GO:0090166  ['Golgi disassembly']   GO:1903008  6
GO:1903008  ['organelle disassembly']   GO:0022411  1
GO:0048311  unknown_term        2

但问题是,在级别 2 中,GO ID go:id 并不像级别 2 中那样出现,因为可以看出,第一行级别 2 GO ID 作为级别 2 应该是 GO:0006996,但它是 GO:0016043,它是1级,

正确的输出是

GO ID  Term    Level 2 GO ID   Frequency
GO:0006996  ['organelle organization']  GO:0006996  1
GO:0048313  ['Golgi inheritance']   GO:0006996  4
GO:0048308  ['organelle inheritance']   GO:0006996  3
GO:0000001  ['mitochondrion inheritance']   GO:0000001  6
GO:0000001  ['mitochondrion inheritance']   GO:0006996  6
GO:0048309  ['endoplasmic reticulum inheritance']   GO:0006996  2
GO:0007029  ['endoplasmic reticulum organization']  GO:0006996  1
GO:0007030  ['Golgi organization']  GO:0006996  1
GO:0090166  ['Golgi disassembly']   GO:1903008  6
GO:0090166  ['Golgi disassembly']   GO:0006996  6
GO:1903008  ['organelle disassembly']   GO:1903008  1
GO:0048311  unknown_term        2
python-3.x recursion bioinformatics ontology-mapping
1个回答
0
投票

这段代码:

        for child in term.get("children", []):
            if child["id"] == go_id:
                level_2_id = term["id"]
                break

...当

level2_id
是顶级术语时,也可以设置
term

我建议采用一种不同的方法并编写一个函数,将path返回给给定的id,即一个ID列表,其中列表中的最后一个是搜索到的id。这是一个更通用的函数,可用于查找层次结构中任何级别的 ID。

因此将

find_level_2_go_id
替换为以下函数:

def find_path_go_id(go_id, terms, hierarchy):
    for term in hierarchy:
        if term["id"] == go_id:
            return [term["id"]]
        path = find_path_go_id(go_id, terms, term.get("children", []))
        if path:
            return [term["id"], *path]

并这样称呼它:

    for go_id, frequency in unique_go_frequency.items():
        term = terms.get(go_id, {})
        term_name = term.get("name", "unknown_term")
        
        path = find_path_go_id(go_id, terms, roots)
        # Extract the Level 2 GO ID based on the path in the hierarchy
        level_2_id = path[1] if path and len(path) > 1 else None
        
        data.append({
            "GO ID": go_id,
            "Term": term_name,
            "Level 2 GO ID": level_2_id,
            "Frequency": frequency
        })
© www.soinside.com 2019 - 2024. All rights reserved.