使用 Playwright 进行 JavaScript 渲染时不执行 Scrapy 回调

问题描述 投票:0回答:0

我正在使用带有 Playwright 插件的 Scrapy 来抓取依赖 JavaScript 进行渲染的网站。我的蜘蛛包含两个异步函数,parse_categories 和 parse_product_page。 parse_categories 函数递归地检查 URL 中的类别并再次向 parse_categories 回调发送请求。如果没有找到类别,它应该向 parse_product_page 回调发送请求。

然而,当它到达parse_categories 中的else 块时,似乎永远不会请求parse_product_page。我已经确认代码进入了else块,但是parse_product_page函数中的print语句始终没有到达

这是我的蜘蛛代码:

    async def parse_categories(self, response):
        page = response.meta['playwright_page']
        await page.close()
        print("Inside parse_categories, URL:", response.url)

        categories = response.xpath('//*[@id="category-column"]/div[position()=1]//a[not(contains(@href,"pre-orders"))]/@href').extract()
        # print("Found categories:", categories)

        if categories:
            print("Categories found")
            for category in categories:
                next_page_url = 'https://www.play-asia.com' + category
                # Keep iterating until arriving at sub category page or products grid page
                yield scrapy.Request(url=next_page_url, callback=self.parse_categories, 
                    meta=dict(
                        playwright = True,
                        playwright_include_page = True,
                        errback = self.errback 
                    ))
        else:
            print("In else block, categories not found.")
            yield scrapy.Request(url=response.request.url, callback=self.parse_product_page, 
                    meta=dict(
                        playwright = True,
                        playwright_include_page = True,
                        playwright_page = response.meta['playwright_page'],
                        playwright_page_methods = [
                            PageMethod('wait_for_selector','#n_pf_holder > div', timeout=50000)
                        ],
                        errback = self.errback 
                    ))


    async def parse_product_page(self,response):
        page = response.meta['playwright_page']
        await page.close()
        # Function is working
        print(f"Processing URL: {response.url}")
        ... rest of function ...

我试图更改超时数,希望页面需要更多时间来加载,但我得到了相同的结果。

我曾尝试对爬虫进行重试,以防它被阻止,但事实并非如此,因为它能够访问从类别中提取的所有 URL,直到找不到类别并且代码运行 else 块。

python-3.x scrapy web-crawler playwright-python scrapy-playwright
© www.soinside.com 2019 - 2024. All rights reserved.