我正在编写一个 Python 脚本来匹配文件中带注释的 GO ID,然后从层次结构中提取 2 级基因本体 (GO) ID。但是,我在提取正确的 2 级 GO ID 并显示相应信息时遇到问题。这是我到目前为止所拥有的: 我使用 go-basic.obo 文件对 file.txt 文件中带注释的 GO ID 及其频率(go_id 总数)进行分类。 我期望在 Excel 文件中输出以下列
GO ID 本身。 GO 层次结构中相应的术语名称。 层次结构中的 2 级 GO ID。 file.txt 中 GO ID 的频率。
我尝试使用此代码,但它无法正确提取 corroespong 注释的 go id 的第 2 级的 go id, 这是代码 将 pandas 导入为 pd
def parse_obo(file_path):
terms = {}
current_term = {}
isterm = False
with open(file_path, 'r') as f:
for line in f:
line = line.strip()
isterm = isterm or line.startswith('id:')
if isterm and ": " in line:
key, value = line.split(': ', 1)
if key == "id":
current_term = terms.setdefault(value, {})
current_term["id"] = value
else:
current_term.setdefault(key, []).append(value)
return terms
def make_hierarchy(terms):
for term in list(terms.values()):
term.setdefault("children", [])
if "is_a" in term:
for is_a in term["is_a"]:
parent = is_a.split()[0]
terms.setdefault(parent, { 'id': parent }).setdefault("children", []).append(term)
return [term for term in terms.values() if "is_a" not in term]
def calculate_go_frequency(file_path):
go_frequency = {}
with open(file_path, 'r') as f:
for line in f:
_, go_id = line.strip().split('\t')
go_frequency[go_id] = go_frequency.get(go_id, 0) + 1
return go_frequency
def find_level_2_go_id(go_id, terms, hierarchy):
for term in hierarchy:
if term["id"] == go_id:
return None
level_2_id = None
for child in term.get("children", []):
if child["id"] == go_id:
level_2_id = term["id"]
break
for sub_child in child.get("children", []):
if sub_child["id"] == go_id:
level_2_id = child["id"]
break
if level_2_id:
break
if level_2_id:
return level_2_id
else:
level_2_id = find_level_2_go_id(go_id, terms, term.get("children", []))
if level_2_id:
return level_2_id
return None
if __name__ == "__main__":
file_path = 'go-basic.obo'
terms = parse_obo(file_path)
roots = make_hierarchy(terms)
# Read and process the file.txt file
unique_file_path = 'file.txt'
unique_go_frequency = calculate_go_frequency(unique_file_path)
# Create a list of dictionaries for the data
data = []
for go_id, frequency in unique_go_frequency.items():
term = terms.get(go_id, {})
term_name = term.get("name", "unknown_term")
# Get the Level 2 GO ID based on the hierarchy
level_2_id = find_level_2_go_id(go_id, terms, roots)
data.append({
"GO ID": go_id,
"Term": term_name,
"Level 2 GO ID": level_2_id,
"Frequency": frequency
})
# Create a DataFrame from the data
df = pd.DataFrame(data)
# Save the DataFrame to an Excel file
output_excel = 'gene_counts_with_levels.xlsx'
df.to_excel(output_excel, index=False)
print(f"Data saved to {output_excel}")
请帮忙 这个问题也是继上一个问题之后 使用 Python 检索基因本体层次结构 从这个示例文件中,级别将像这样分类,如下所示
level1 GO:0048311
level2 GO:0000001
level1 GO:0022411
level2 GO:1903008
level3 GO:0090166
level1 GO:0016043
level2 GO:0006996
level3 GO:0048308
level4 GO:0000001
level4 GO:0048309
level4 GO:0048313
level3 GO:0007029
level4 GO:0048309
level3 GO:0007030
level4 GO:0048313
level4 GO:0090166
level3 GO:1903008
level4 GO:0090166
这是我的文件
id GO_ID
OGOAOAJO_00443 GO:0006996
OGOAOAJO_00021 GO:0048313
OGOAOAJO_00021 GO:0048308
OGOAOAJO_00830 GO:0048308
OGOAOAJO_00830 GO:0048308
OGOAOAJO_02897 GO:0000001
OGOAOAJO_02897 GO:0000001
OGOAOAJO_00517 GO:0000001
OGOAOAJO_01362 GO:0000001
OGOAOAJO_02172 GO:0048309
OGOAOAJO_02172 GO:0048313
OGOAOAJO_03684 GO:0007029
OGOAOAJO_03684 GO:0000166
OGOAOAJO_03684 GO:0048309
OGOAOAJO_03684 GO:0007030
OGOAOAJO_01125 GO:0048313
OGOAOAJO_01125 GO:0048313
OGOAOAJO_00657 GO:0090166
OGOAOAJO_00223 GO:0090166
OGOAOAJO_00223 GO:0090166
OGOAOAJO_03140 GO:1903008
OGOAOAJO_03140 GO:0090166
OGOAOAJO_03647 GO:0090166
OGOAOAJO_03407 GO:0090166
OGOAOAJO_00603 GO:0048311
OGOAOAJO_00603 GO:0048311
OGOAOAJO_01401 GO:0000001
OGOAOAJO_00603 GO:0000001
我的输出就像这样
GO ID Term Level 2 GO ID Frequency
GO:0006996 ['organelle organization'] GO:0016043 1
GO:0048313 ['Golgi inheritance'] GO:0048308 4
GO:0048308 ['organelle inheritance'] GO:0006996 3
GO:0000001 ['mitochondrion inheritance'] GO:0048311 6
GO:0046872 unknown_term 1
GO:0048309 ['endoplasmic reticulum inheritance'] GO:0048308 2
GO:0003824 unknown_term 1
GO:0007029 ['endoplasmic reticulum organization'] GO:0006996 1
GO:0007030 ['Golgi organization'] GO:0006996 1
GO:0090166 ['Golgi disassembly'] GO:1903008 6
GO:1903008 ['organelle disassembly'] GO:0022411 1
GO:0048311 unknown_term 2
但问题是,在级别 2 中,GO ID go:id 并不像级别 2 中那样出现,因为可以看出,第一行级别 2 GO ID 作为级别 2 应该是 GO:0006996,但它是 GO:0016043,它是1级,
正确的输出是
GO ID Term Level 2 GO ID Frequency
GO:0006996 ['organelle organization'] GO:0006996 1
GO:0048313 ['Golgi inheritance'] GO:0006996 4
GO:0048308 ['organelle inheritance'] GO:0006996 3
GO:0000001 ['mitochondrion inheritance'] GO:0000001 6
GO:0000001 ['mitochondrion inheritance'] GO:0006996 6
GO:0048309 ['endoplasmic reticulum inheritance'] GO:0006996 2
GO:0007029 ['endoplasmic reticulum organization'] GO:0006996 1
GO:0007030 ['Golgi organization'] GO:0006996 1
GO:0090166 ['Golgi disassembly'] GO:1903008 6
GO:0090166 ['Golgi disassembly'] GO:0006996 6
GO:1903008 ['organelle disassembly'] GO:1903008 1
GO:0048311 unknown_term 2
这段代码:
for child in term.get("children", []):
if child["id"] == go_id:
level_2_id = term["id"]
break
...当
level2_id
是顶级术语时,也可以设置 term
。
我建议采用一种不同的方法并编写一个函数,将path返回给给定的id,即一个ID列表,其中列表中的最后一个是搜索到的id。这是一个更通用的函数,可用于查找层次结构中任何级别的 ID。
因此将
find_level_2_go_id
替换为以下函数:
def find_path_go_id(go_id, terms, hierarchy):
for term in hierarchy:
if term["id"] == go_id:
return [term["id"]]
path = find_path_go_id(go_id, terms, term.get("children", []))
if path:
return [term["id"], *path]
并这样称呼它:
for go_id, frequency in unique_go_frequency.items():
term = terms.get(go_id, {})
term_name = term.get("name", "unknown_term")
path = find_path_go_id(go_id, terms, roots)
# Extract the Level 2 GO ID based on the path in the hierarchy
level_2_id = path[1] if path and len(path) > 1 else None
data.append({
"GO ID": go_id,
"Term": term_name,
"Level 2 GO ID": level_2_id,
"Frequency": frequency
})