我是一个新的webscraping,并试图达到每个pdb ID的pdb网页,如。
url= https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_entity_source_organism.rcsb_gene_name.value%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22MCF3%22%7D%2C%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22node_id%22%3A0%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%22c75bbf18a058a812f5384f297528d4b6%22%7D%7D
在这里,我试图获得ID,如: "3ZBF"
和 "4UXL"
.
我写了下面的代码。
url='https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_entity_source_organism.rcsb_gene_name.value%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22MCF3%22%7D%2C%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22node_id%22%3A0%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%22c75bbf18a058a812f5384f297528d4b6%22%7D%7D'
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")
tex_tag= soup.find('div',{"class":"container","id":"maincontentcontainer"})
new_line=tex_tag.find("div",{"id":"search-container"})
print(new_line)
print(tex_tag.prettify())
这里,我看不到< div id="search-container">里面的内容。我检查了网页上的html文件,pdb的IDS在< div id="search-container" >里面。
你能不能给我一个解决方案,或者给我一个见解,让我如何解决这个问题。
先谢谢你。
这个网站在渲染之前使用了一个API来获取结果。它来自于这个url 。
POST https://www.rcsb.org/search/gql
用JSON输入的标识符列表。
import requests
ids = ["3ZBF","4UXL"]
r = requests.post("https://www.rcsb.org/search/gql",
json = {
"attributes": None,
"identifiers": ids,
"returnType": "entry",
"report": "search_summary"
})
print(r.json())
这个脚本将根据你的搜索URL打印所有标识符。
import json
import requests
import urllib.parse
url = 'https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_entity_source_organism.rcsb_gene_name.value%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22MCF3%22%7D%2C%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22node_id%22%3A0%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%22c75bbf18a058a812f5384f297528d4b6%22%7D%7D'
search_link = 'https://www.rcsb.org/search/data'
json_data = url.split('=')[-1]
json_data = json.loads(urllib.parse.unquote(json_data))
data = requests.post(search_link, json=json_data).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for r in data['result_set']:
print(r['identifier'])
打印。
3ZBF
4UXL