我正在尝试抓取一个页面,我想等到在
script
元素中检测到字符串后再返回页面的 HTML。
这是我的 MRE 刮刀:
from scrapy import Request, Spider
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod
class FlashscoreSpider(Spider):
name = "flashscore"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
}
def start_requests(self):
yield Request(
url="https://www.flashscore.com/match/WKM03Vff/#/match-summary/match-summary",
meta=dict(
dont_redirect=True,
playwright=True,
playwright_page_methods=[
PageMethod(
method="wait_for_selector",
selector="//script[contains(text(), 'WKM03Vff')]",
timeout=5000,
),
],
),
callback=self.parse,
)
def parse(self, response):
print("I've loaded the page ready to parse!!!")
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(FlashscoreSpider)
process.start()
这会导致以下错误:
playwright._impl._api_types.TimeoutError: Timeout 5000ms exceeded.
我的理解是,这是因为
script
中有多个文本节点,而我只使用 XPath 选取第一个文本节点。由于我要查找的字符串位于稍后的节点中,因此我收到 TimeoutError
错误。
这个answer提供了一个简洁的解决方案,但是scrapy不支持XPath 2.0,所以当我使用时:
"string-join(//script/text()[normalize-space()], ' ')"
我收到以下错误:
playwright._impl._api_types.Error: Unexpected token "string-join(" while parsing selector "string-join(//script/text()[normalize-space()], ' ')"
答案的评论中给出了另一种选择,但我担心文本节点的数量正在变化。
通过一些相当密集的谷歌搜索,我认为没有一个强大的 XPath 解决方案。但是,有等效的 CSS 吗?我试过:
"script:has-text('WKM03Vff')"
但是,这又会导致
Timeout
异常。
正如我在评论中提到的,脚本标签通常不需要等待任何时间,因为它们不需要渲染。
您应该能够立即从解析方法中访问它们的内容。
例如:
from scrapy import Request, Spider
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod
class FlashscoreSpider(Spider):
name = "flashscore"
custom_settings = {
"ROBOTSTXT_OBEY": False,
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
}
def start_requests(self):
yield Request(
url="https://www.flashscore.com/match/WKM03Vff/#/match-summary/match-summary",
meta=dict(
dont_redirect=True,
playwright=True,
playwright_include_page=True),
callback=self.parse,
)
def parse(self, response):
print(response.xpath("//script[contains(text(), 'WKM03Vff')]"))
print(response.xpath("//script[contains(text(), 'WKM03Vff')]/text()").get())
print("I've loaded the page ready to parse!!!")
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(FlashscoreSpider)
process.start()
2023-09-13 00:07:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET
https://cdn.cookielaw.org/scripttemplates/202210.1.0/assets/otCommonStyles.css>
(resource type: fetch, referrer: https://www.flashscore.com/)
[<Selector query="//script[contains(text(), 'WKM03Vff')]"
data='<script>\n\t\t\twindow.environment = {"ev...'>]
window.environment = {"event_id_c":"WKM03Vff",
"eventStageTranslations":{"1":" ","45":"To finish","42":"Awaiting
updates","2":"Live","17": "Set 1","18":"Set 2","19":"Set 3","20":"Set
4","21":"Set 5","47":"Set 1 - Tiebreak","48":"Set 2 - Tiebreak","49":"Set 3 -
Tiebreak","50":"Set 4 - Tiebreak","51":"Set 5 - Tiebreak","46":"Break
Time","3":"Finished",....p10:100","port":443,"sslEnabled":true,"namespace":"\/f
s\/fs3_","projectId":2,"enabled":false},"project_id":2};
I've loaded the page ready to parse!!!
2023-09-13 00:07:02 [scrapy.core.engine] INFO: Closing spider (finished)