在具有多个文本节点的字符串元素的文本中查找字符串

问题描述 投票:0回答:1

我正在尝试抓取一个页面,我想等到在

script
元素中检测到字符串后再返回页面的 HTML。

这是我的 MRE 刮刀:

from scrapy import Request, Spider
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod


class FlashscoreSpider(Spider):
    name = "flashscore"
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
    }

    def start_requests(self):
        yield Request(
            url="https://www.flashscore.com/match/WKM03Vff/#/match-summary/match-summary",
            meta=dict(
                dont_redirect=True,
                playwright=True,
                playwright_page_methods=[
                    PageMethod(
                        method="wait_for_selector",
                        selector="//script[contains(text(), 'WKM03Vff')]",
                        timeout=5000,
                    ),
                ],
            ),
            callback=self.parse,
        )

    def parse(self, response):
        print("I've loaded the page ready to parse!!!")


if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(FlashscoreSpider)
    process.start()

这会导致以下错误:

playwright._impl._api_types.TimeoutError: Timeout 5000ms exceeded.

我的理解是,这是因为

script
中有多个文本节点,而我只使用 XPath 选取第一个文本节点。由于我要查找的字符串位于稍后的节点中,因此我收到
TimeoutError
错误。

这个answer提供了一个简洁的解决方案,但是scrapy不支持XPath 2.0,所以当我使用时:

"string-join(//script/text()[normalize-space()], ' ')"

我收到以下错误:

playwright._impl._api_types.Error: Unexpected token "string-join(" while parsing selector "string-join(//script/text()[normalize-space()], ' ')"

答案的评论中给出了另一种选择,但我担心文本节点的数量正在变化。

通过一些相当密集的谷歌搜索,我认为没有一个强大的 XPath 解决方案。但是,有等效的 CSS 吗?我试过:

"script:has-text('WKM03Vff')"

但是,这又会导致

Timeout
异常。

python xpath scrapy scrapy-playwright
1个回答
0
投票

正如我在评论中提到的,脚本标签通常不需要等待任何时间,因为它们不需要渲染。

您应该能够立即从解析方法中访问它们的内容。

例如:

from scrapy import Request, Spider
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod


class FlashscoreSpider(Spider):
    name = "flashscore"
    custom_settings = {
        "ROBOTSTXT_OBEY": False,
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
    }

    def start_requests(self):
        yield Request(
            url="https://www.flashscore.com/match/WKM03Vff/#/match-summary/match-summary",
            meta=dict(
                dont_redirect=True,
                playwright=True,
                playwright_include_page=True),
            callback=self.parse,
        )

    def parse(self, response):
        print(response.xpath("//script[contains(text(), 'WKM03Vff')]"))
        print(response.xpath("//script[contains(text(), 'WKM03Vff')]/text()").get())
        print("I've loaded the page ready to parse!!!")


if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(FlashscoreSpider)
    process.start()

部分输出

2023-09-13 00:07:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET 
https://cdn.cookielaw.org/scripttemplates/202210.1.0/assets/otCommonStyles.css> 
(resource type: fetch, referrer: https://www.flashscore.com/)
[<Selector query="//script[contains(text(), 'WKM03Vff')]" 
data='<script>\n\t\t\twindow.environment = {"ev...'>]

                        window.environment = {"event_id_c":"WKM03Vff",
"eventStageTranslations":{"1":"&nbsp;","45":"To finish","42":"Awaiting 
updates","2":"Live","17": "Set 1","18":"Set 2","19":"Set 3","20":"Set 
4","21":"Set 5","47":"Set 1 - Tiebreak","48":"Set 2 - Tiebreak","49":"Set 3 - 
Tiebreak","50":"Set 4 - Tiebreak","51":"Set 5 - Tiebreak","46":"Break 
Time","3":"Finished",....p10:100","port":443,"sslEnabled":true,"namespace":"\/f
s\/fs3_","projectId":2,"enabled":false},"project_id":2};

I've loaded the page ready to parse!!!
2023-09-13 00:07:02 [scrapy.core.engine] INFO: Closing spider (finished)
© www.soinside.com 2019 - 2024. All rights reserved.