我正在使用带有 Playwright 插件的 Scrapy 来抓取依赖 JavaScript 进行渲染的网站。我的蜘蛛包含两个异步函数,parse_categories 和 parse_product_page。 parse_categories 函数递归地检查 URL 中的类别并再次向 parse_categories 回调发送请求。如果没有找到类别,它应该向 parse_product_page 回调发送请求。
然而,当它到达parse_categories 中的else 块时,似乎永远不会请求parse_product_page。我已经确认代码进入了else块,但是parse_product_page函数中的print语句始终没有到达
这是我的蜘蛛代码:
async def parse_categories(self, response):
page = response.meta['playwright_page']
await page.close()
print("Inside parse_categories, URL:", response.url)
categories = response.xpath('//*[@id="category-column"]/div[position()=1]//a[not(contains(@href,"pre-orders"))]/@href').extract()
# print("Found categories:", categories)
if categories:
print("Categories found")
for category in categories:
next_page_url = 'https://www.play-asia.com' + category
# Keep iterating until arriving at sub category page or products grid page
yield scrapy.Request(url=next_page_url, callback=self.parse_categories,
meta=dict(
playwright = True,
playwright_include_page = True,
errback = self.errback
))
else:
print("In else block, categories not found.")
yield scrapy.Request(url=response.request.url, callback=self.parse_product_page,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page = response.meta['playwright_page'],
playwright_page_methods = [
PageMethod('wait_for_selector','#n_pf_holder > div', timeout=50000)
],
errback = self.errback
))
async def parse_product_page(self,response):
page = response.meta['playwright_page']
await page.close()
# Function is working
print(f"Processing URL: {response.url}")
... rest of function ...
我试图更改超时数,希望页面需要更多时间来加载,但我得到了相同的结果。
我曾尝试对爬虫进行重试,以防它被阻止,但事实并非如此,因为它能够访问从类别中提取的所有 URL,直到找不到类别并且代码运行 else 块。