使用biopython对应于蛋白质序列中缺失残基的缺口字符

问题描述 投票:0回答:1

我从pdb中提取序列而没有遗漏残基,但我需要一个具有缺口(短划线字符)替换缺失残基的序列。我该怎么做?

谢谢

biopython pdb
1个回答
0
投票

我通过解析pdb的ATOM记录以获取现有残差并通过解析REMARK 465以获取缺失残差来手动完成:

pname = '4g5j.pdb' #downloaded file from PDB

fin = open(pname,'r')
content = fin.readlines()
fin.close()

res = []
mis_res = []

print('CHAIN A will be used for ALIGNMENT')
het_chain = 'A'

for i,line in enumerate(content):

  if line[0:4] == 'ATOM':

    split = [line[:6], line[6:11], line[12:16], line[17:20], line[21], line[22:26], line[30:38], line[38:46], line[46:54]]

    if split[4] != het_chain:
      continue
    res.append(int(split[5]))

for i,line in enumerate(content):

  if line[0:10] == 'REMARK 465':

    split = [line[:10], line[19], line[21:26]]

    if split[1] == het_chain:

      mis_res.append(int(split[2]))

resindexes = sorted(list(set(sorted(res))))

missed_resindexes = sorted(list(set(mis_res)))
missed_resindexes = [el for el in missed_resindexes if el not in resindexes]

all_indexes = sorted(resindexes+missed_resindexes)
print(len(all_indexes))

#here you should have your real sequence!
real_seq = 'GSMGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQG'

missed_seq = real_seq[:]

for i,el in enumerate(real_seq):
  if all_indexes[i] in missed_resindexes:
    print(i)
    missed_seq = missed_seq[:i]+'-'+missed_seq[i+1:]

print(missed_seq)

OUTPUT:--- GEAPNQALLRILKETEFKKIKVLGS ---- TVYKGLWIPEGEKVKIPVAIKE ---------- KEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEY ------

© www.soinside.com 2019 - 2024. All rights reserved.