scrapy 运行同一个蜘蛛的数千个实例

问题描述 投票:0回答:1

我有以下任务: 在数据库中,我们有大约 2k 个 URL。 对于每个 URL,我们需要运行 spider 直到所有 URL 都被处理。 我正在为一堆 URL 运行蜘蛛(一次运行 10 个)

我使用了以下代码:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
    process = CrawlerProcess(settings)

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl['crawl_url']]
           )
    process = CrawlerProcess(settings)
    process.start()

但它只在第一个循环中运行。 第二次我有错误:

  File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 840, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

有什么解决办法可以避免这个错误吗?并为所有 2k URL 运行蜘蛛?

python scrapy twisted
1个回答
0
投票

这是因为你不能在同一个进程中启动twisted reactor两次。你需要做的是定义和循环外的过程:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
process = CrawlerProcess(settings)
for i in range(len(URLs) // 10):

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl['crawl_url']]
           )

process.start()

你可以查看文档中提供的example

© www.soinside.com 2019 - 2024. All rights reserved.