我正在尝试使用 Bio.KEGG 中的 BioPython 的 REST 模块来查询 KEGG 数据库,以获取某些化合物的名称和分子式,使用化合物化学识别号 (CID),例如C0001 是水,C00123 是亮氨酸等:
from Bio.KEGG import REST
from Bio.KEGG import Compound
def cpd_decoder(cid): #gets the compound name and formula from KEGG
if "C" in cid:
cid="cpd:"+cid
kegg_entry=REST.kegg_get(cid)
for record in Compound.parse(kegg_entry):
cid_name=record.name[0]
cid_formula=record.formula
return cid_name,cid_formula
cid="C00123" #example CID; this one's for leucine
if cpd_decoder(cid) !=None:
compound,formula=cpd_decoder(cid)
然而,尽管 BioPython 使用 KEGG 自己的 API,我几乎总是收到以下错误:
if cpd_decoder(cid) !=None:
File "/media/tessa/Storage/Alien_Earths/Network_expansion/network expansion test 2.py", line 27, in cpd_decoder
kegg_entry=REST.kegg_get(cid)
File "/home/tessa/.local/lib/python3.10/site-packages/Bio/KEGG/REST.py", line 208, in kegg_get
resp = _q("get", dbentries)
File "/home/tessa/.local/lib/python3.10/site-packages/Bio/KEGG/REST.py", line 44, in _q
resp = urlopen(URL % (args))
File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/usr/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
我想知道是否因为我正在处理大量 CID,KEGG 现在认为我是机器人并阻止了我。有办法解决这个问题吗?
从今天起我使用与您的脚本非常相似的脚本得到了同样的结果。一个月前,当我运行相同的脚本时,这种情况没有发生。它本质上是通过约 250 个 KO 编号的列表来获取与其关联的反应 ID,然后检索每个反应的反应化学计量以自动生成反应矩阵。
我发现在第 240 次 KO 之前一切都很好,但随后我开始收到“403 禁止”错误。当我在浏览器中手动输入该 URL 时,它仍然存在,而当进入另一个网络时,它就消失了。然后我重试,得到了相同的结果。所以看起来 KEGG 最近开始禁止做类似事情的用户。
我找到了修复方法,它大大加快了代码速度。我之前没有使用过Biopython,而是使用了python中的requests包。也许你可以在 Biopython 中做同样的事情。
您可以将所有 KO 编号放入单个请求中,而不是分别为每个 KO 编号(或在您的情况下为复合 ID)发出连接请求。 所以,与其请求:
https://rest.kegg.jp/link/reaction/ko:K00012
https://rest.kegg.jp/link/reaction/ko:K12450
等等..
你可以这样做:
https://rest.kegg.jp/link/reaction/ko:K00012+K12450+
这也运行得更快,因为你只需要等待 KEGG 响应一次。然后你只需要解析结果(可能 Biopython 已经可以做到)
这是我的代码:
import requests
#Replace by your own query
KO_numbers = ["K00012", "K12450", "K21379"]
#Define the start of the URL, replace with the URL for your own need
url = "https://rest.kegg.jp/link/reaction/ko:"
#For each KO number in the list: add it to the URL, and put a "+" in between
for KO in KO_numbers:
url += KO
url += "+"
#Do the actual request, raise an error if something is wrong
response = requests.get(url)
if response.status_code != 200:
raise ConnectionError("Cannot connect to KEGG API")
#Here I just print the response, but from here you need to parse it to do what you want to do with the data
print(response.text)