在 Python 或 Bash 中计算和显示大文件中残基之间的交互

问题描述 投票:0回答:1

我有一个大文件,其中包含有关蛋白质中残基之间相互作用的数据。文件中的每一行代表一个相互作用,第一列表示相互作用的类型(例如 sb、pc、vdw、hb),第二列和第三列表示相互作用涉及的残基。

这是数据格式的示例:

sb ASP-11 LYS-15
sb GLU-309 HIS-46
sb ASP-11 LYS-15
sb GLU-103 HIS-296
sb ARG-290 GLU-72
sb GLU-103 HIS-296
sb GLU-86 LYS-90
sb ASP-113 LYS-117
sb ASP-279 LYS-15
sb ASP-61 HIS-296
sb ARG-114 ASP-113
sb ASP-113 LYS-117
sb GLU-10 LYS-14
sb GLU-309 HIS-46
sb ASP-279 LYS-15
pc HIS-46 TYR-45
pc ARG-156 TYR-158
pc HIS-182 TRP-153
pc LYS-242 PHE-50
pc ARG-156 TYR-158
pc HIS-275 LYS-282
pc LYS-311 TRP-304
pc HIS-182 TRP-153
pc ARG-114 PHE-50
pc ARG-114 PHE-50
ps PHE-41 TYR-45
vdw ASP-270 LEU-273
vdw ASP-270 LYS-272
vdw ASP-270 LYS-272
vdw HIS-275 LEU-277
vdw LEU-273 LEU-277
vdw GLU-276 LYS-272
vdw GLU-276 LYS-272

我需要处理此文件以计算每种类型相互作用的数量以及涉及的残基,格式如下:

sb    pc    vdw    hb    residue1    residue2
2     0     0      0     ASP-11      LYS-15

对于“hb”相互作用,有多种类型(例如,hbbb、hbbs),但我想将它们全部计入“hb”下。

我正在寻找 Python 或 Bash 脚本来实现此目的。具体来说,脚本应该:

  1. 从文件中读取数据。
  2. 计算每种类型交互的发生次数。
  3. 以指定格式显示计数和参与相互作用的残基。

我尝试了以下脚本,但它不起作用:

# Define a dictionary to store the counts of each interaction type
interaction_counts = {}

# Open the file and read line by line
with open("file1.txt", "r") as file:
    for line in file:
        # Split the line into interaction type and residues
        interaction, residue1, residue2 = line.strip().split()

        # Construct a unique key for each interaction based on residue pair
        interaction_key = "-".join(sorted([residue1, residue2]))

        # Update the counts for this interaction type
        interaction_counts[interaction_key] = interaction_counts.get(interaction_key, {})
        interaction_counts[interaction_key][interaction] = interaction_counts[interaction_key].get(interaction, 0) + 1

# Print the header
print("sb\tpc\tvdw\thb\tresidue1\tresidue2")

# Iterate over the interaction counts dictionary and print formatted output
for interaction_key, counts in interaction_counts.items():
    residue1, residue2 = interaction_key.split("-")
    sb_count = counts.get("sb", 0)
    pc_count = counts.get("pc", 0)
    vdw_count = counts.get("vdw", 0)
    hb_count = counts.get("hb", 0)
    print(f"{sb_count}\t{pc_count}\t{vdw_count}\t{hb_count}\t{residue1}\t{residue2}")

出现以下错误:

sb  pc  vdw hb  residue1    residue2
Traceback (most recent call last):
  File "/home/count_interactions.py", line 22, in <module>
    residue1, residue2 = interaction_key.split("-")
    ^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)

我愿意接受任何 Python 或 Bash 的建议或解决方案。谢谢!

python bash bioinformatics
1个回答
0
投票

你的错误源于对某些行的错误解析——我认为有例如某处有空行。

这可以解决您的示例数据的问题:

import collections

interactions = collections.defaultdict(collections.Counter)

with open("file1.txt", "r") as file:
    for line in file:
        line = line.strip().split()
        if len(line) == 3:
            pair = frozenset(line[1:])
            interactions[pair][line[0]] += 1

interaction_kinds = ["sb", "pc", "vdw", "hb"]
print(*interaction_kinds, "residue1", "residue2", sep="\t")
for pair, counts in interactions.items():
    r1, r2 = sorted(pair)
    counts = [counts.get(x, 0) for x in interaction_kinds]
    print(*counts, r1, r2, sep="\t")
© www.soinside.com 2019 - 2024. All rights reserved.