当 TCP 连接冻结时，Scrapy 在超时限制时强制关闭

Question

在我的 scpraper 中，我有一个特定的 url，它会定期下降。完成统计数据显示

 'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 2,
 'elapsed_time_seconds': 150.027039,
 'finish_reason': 'closespider_timeout',

我在我的设置中添加了

CLOSESPIDER_TIMEOUT=30

，但是上面的爬行在以上述统计数据终止之前需要

150s

。 这是为什么？

我还在抓取工具中设置了自定义下载超时：

custom_args = {
    'DOWNLOAD_TIMEOUT': 12
}

但这也没有得到尊重。根据这个SO问题，操作系统连接限制配置似乎优先于scrapy相关的连接限制。 有没有办法为scrapy添加硬杀限制？比如中间件或者强制爬虫退出的信号？

Answer 1

from scrapy.exceptions import CloseSpider
import time

class TimeoutMiddleware:
    def __init__(self, timeout):
        self.timeout = timeout
        self.start_time = time.time()

    @classmethod
    def from_crawler(cls, crawler):
        timeout = crawler.settings.getint('HARD_TIMEOUT', 180)  # Default to 180 seconds
        return cls(timeout)

    def process_spider_output(self, response, result, spider):
        if time.time() - self.start_time >= self.timeout:
            raise CloseSpider('Reached hard timeout limit')
        return result

然后，您可以通过将其添加到 DOWNLOADER_MIDDLEWARES 设置来在您的设置中激活此中间件。

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.TimeoutMiddleware': 950,
}

调整设置中的 HARD_TIMEOUT 值以指定抓取工具所需的最大运行时间。

当 TCP 连接冻结时，Scrapy 在超时限制时强制关闭

问题描述投票：0回答：1

1个回答

最新问题

当 TCP 连接冻结时，Scrapy 在超时限制时强制关闭

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1