此带有邮政编码查询字符串的 URL 可在浏览器中正确加载搜索结果:
https://www.psychotherapy.org.uk/find-a-therapy/?Location=M3%201AR&Distance=10&page=7
每个搜索结果都有自己的 h2 标签。在 scrapy shell 中,我得到 200 响应,但 scrapy 得到的唯一 html 是页眉、页脚、菜单等内容,即忽略搜索结果 html。
下面是 h2 标签的示例,但对于任何标签都是相同的。
有什么解释吗?
In [1]: fetch('https://www.psychotherapy.org.uk/find-a-therapist/?Location=M3%201AR&Distance=10&page=7')
2024-04-12 15:45:28 [scrapy.core.engine] INFO: Spider opened
2024-04-12 15:45:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fwww.psychotherapy.org.uk%2Frobots.txt> (referer: None)
2024-04-12 15:45:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fproxy.scrapeops.io%2Frobots.txt> (referer: None)
2024-04-12 15:45:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fwww.psychotherapy.org.uk%2Ffind-a-therapist%2F%3FLocation%3DM3%25201AR%26Distance%3D10%26page%3D7> (referer: None)
In [2]: response.css('h2').getall()
Out[2]:
['<h2>Refine your search</h2>',
'<h2>Looking for a specific therapist?</h2>',
'<h2>\r\n <span class="sub">Bookmarks</span>\r\n My Shortlist\r\n </h2>',
'<h2>Contact us</h2>',
'<h2>Links</h2>',
'<h2>Connect with us</h2>']
In [3]:
正如 Lakshmanarao Simhadri 指出的那样,页面加载时还有另一个 POST 请求。检查发送到此处的网络选项卡:https://www.psychotherapy.org.uk/umbraco/Surface/SearchSurface/Search
您需要为第二个 POST 请求提供表单数据(也可以从网络选项卡中检索)。此外,您可以使用 scrapy 中的
FormRequest
类来组装请求。
以下示例可以通过 scrapy shell 运行:
form_data = {
"HelpWith": "",
"InPerson": "false",
"Remote": "false",
"Location": "M3+1AR",
"Pager.CurrentPage": "7",
"KeywordFilter": "",
"Distance": "10",
"LocationSearchOutsideUK": "false",
"OnlyProfilesWithPhotos": "false",
"OnlyWheelchairAccessible": "false",
"OrderSeed": "2107061618",
"X-Requested-With": "XMLHttpRequest"
}
req = scrapy.FormRequest(url="https://www.psychotherapy.org.uk/umbraco/Surface/SearchSurface/Search", formdata=form_data)
fetch(req)
response.css(".profile-listing h2::text").getall()
上面的 CSS 查询应该打印治疗师的姓名。