为什么Scrapy获取不到这个html?

问题描述 投票:0回答:1

此带有邮政编码查询字符串的 URL 可在浏览器中正确加载搜索结果:

https://www.psychotherapy.org.uk/find-a-therapy/?Location=M3%201AR&Distance=10&page=7

每个搜索结果都有自己的 h2 标签。在 scrapy shell 中,我得到 200 响应,但 scrapy 得到的唯一 html 是页眉、页脚、菜单等内容,即忽略搜索结果 html。

下面是 h2 标签的示例,但对于任何标签都是相同的。

有什么解释吗?

In [1]: fetch('https://www.psychotherapy.org.uk/find-a-therapist/?Location=M3%201AR&Distance=10&page=7')
2024-04-12 15:45:28 [scrapy.core.engine] INFO: Spider opened
2024-04-12 15:45:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fwww.psychotherapy.org.uk%2Frobots.txt> (referer: None)
2024-04-12 15:45:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fproxy.scrapeops.io%2Frobots.txt> (referer: None)
2024-04-12 15:45:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fwww.psychotherapy.org.uk%2Ffind-a-therapist%2F%3FLocation%3DM3%25201AR%26Distance%3D10%26page%3D7> (referer: None)

In [2]: response.css('h2').getall()
Out[2]:
['<h2>Refine your search</h2>',
 '<h2>Looking for a specific therapist?</h2>',
 '<h2>\r\n                        <span class="sub">Bookmarks</span>\r\n                        My Shortlist\r\n                    </h2>',
 '<h2>Contact us</h2>',
 '<h2>Links</h2>',
 '<h2>Connect with us</h2>']

In [3]:
web-scraping scrapy
1个回答
0
投票

正如 Lakshmanarao Simhadri 指出的那样,页面加载时还有另一个 POST 请求。检查发送到此处的网络选项卡:https://www.psychotherapy.org.uk/umbraco/Surface/SearchSurface/Search

您需要为第二个 POST 请求提供表单数据(也可以从网络选项卡中检索)。此外,您可以使用 scrapy 中的

FormRequest
类来组装请求。

以下示例可以通过 scrapy shell 运行:

form_data = {
    "HelpWith": "",
    "InPerson": "false",
    "Remote": "false",
    "Location": "M3+1AR",
    "Pager.CurrentPage": "7",
    "KeywordFilter": "",
    "Distance": "10",
    "LocationSearchOutsideUK": "false",
    "OnlyProfilesWithPhotos": "false",
    "OnlyWheelchairAccessible": "false",
    "OrderSeed": "2107061618",
    "X-Requested-With": "XMLHttpRequest"
}
req = scrapy.FormRequest(url="https://www.psychotherapy.org.uk/umbraco/Surface/SearchSurface/Search", formdata=form_data)
fetch(req)
response.css(".profile-listing h2::text").getall()

上面的 CSS 查询应该打印治疗师的姓名。

© www.soinside.com 2019 - 2024. All rights reserved.