我有一个大文件,其中包含有关蛋白质中残基之间相互作用的数据。文件中的每一行代表一个相互作用,第一列表示相互作用的类型(例如 sb、pc、vdw、hb),第二列和第三列表示相互作用涉及的残基。
这是数据格式的示例:
sb ASP-11 LYS-15
sb GLU-309 HIS-46
sb ASP-11 LYS-15
sb GLU-103 HIS-296
sb ARG-290 GLU-72
sb GLU-103 HIS-296
sb GLU-86 LYS-90
sb ASP-113 LYS-117
sb ASP-279 LYS-15
sb ASP-61 HIS-296
sb ARG-114 ASP-113
sb ASP-113 LYS-117
sb GLU-10 LYS-14
sb GLU-309 HIS-46
sb ASP-279 LYS-15
pc HIS-46 TYR-45
pc ARG-156 TYR-158
pc HIS-182 TRP-153
pc LYS-242 PHE-50
pc ARG-156 TYR-158
pc HIS-275 LYS-282
pc LYS-311 TRP-304
pc HIS-182 TRP-153
pc ARG-114 PHE-50
pc ARG-114 PHE-50
ps PHE-41 TYR-45
vdw ASP-270 LEU-273
vdw ASP-270 LYS-272
vdw ASP-270 LYS-272
vdw HIS-275 LEU-277
vdw LEU-273 LEU-277
vdw GLU-276 LYS-272
vdw GLU-276 LYS-272
我需要处理此文件以计算每种类型相互作用的数量以及涉及的残基,格式如下:
sb pc vdw hb residue1 residue2
2 0 0 0 ASP-11 LYS-15
对于“hb”相互作用,有多种类型(例如,hbbb、hbbs),但我想将它们全部计入“hb”下。
我正在寻找 Python 或 Bash 脚本来实现此目的。具体来说,脚本应该:
我尝试了以下脚本,但它不起作用:
# Define a dictionary to store the counts of each interaction type
interaction_counts = {}
# Open the file and read line by line
with open("file1.txt", "r") as file:
for line in file:
# Split the line into interaction type and residues
interaction, residue1, residue2 = line.strip().split()
# Construct a unique key for each interaction based on residue pair
interaction_key = "-".join(sorted([residue1, residue2]))
# Update the counts for this interaction type
interaction_counts[interaction_key] = interaction_counts.get(interaction_key, {})
interaction_counts[interaction_key][interaction] = interaction_counts[interaction_key].get(interaction, 0) + 1
# Print the header
print("sb\tpc\tvdw\thb\tresidue1\tresidue2")
# Iterate over the interaction counts dictionary and print formatted output
for interaction_key, counts in interaction_counts.items():
residue1, residue2 = interaction_key.split("-")
sb_count = counts.get("sb", 0)
pc_count = counts.get("pc", 0)
vdw_count = counts.get("vdw", 0)
hb_count = counts.get("hb", 0)
print(f"{sb_count}\t{pc_count}\t{vdw_count}\t{hb_count}\t{residue1}\t{residue2}")
出现以下错误:
sb pc vdw hb residue1 residue2
Traceback (most recent call last):
File "/home/count_interactions.py", line 22, in <module>
residue1, residue2 = interaction_key.split("-")
^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)
我愿意接受任何 Python 或 Bash 的建议或解决方案。谢谢!
你的错误源于对某些行的错误解析——我认为有例如某处有空行。
这可以解决您的示例数据的问题:
import collections
interactions = collections.defaultdict(collections.Counter)
with open("file1.txt", "r") as file:
for line in file:
line = line.strip().split()
if len(line) == 3:
pair = frozenset(line[1:])
interactions[pair][line[0]] += 1
interaction_kinds = ["sb", "pc", "vdw", "hb"]
print(*interaction_kinds, "residue1", "residue2", sep="\t")
for pair, counts in interactions.items():
r1, r2 = sorted(pair)
counts = [counts.get(x, 0) for x in interaction_kinds]
print(*counts, r1, r2, sep="\t")