获取Scrapy爬虫输出/结果脚本文件功能

Question

我正在使用脚本文件在scrapy项目中运行spider并且spider正在记录crawler输出/结果。但我想在某个函数中使用该脚本文件中的spider输出/结果。我不想将输出/结果保存在任何文件或数据库中。这是脚本代码来自https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())


d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()

def spider_output(output):
#     do something to that output

如何在'spider_output'方法中获得蜘蛛输出。可以获得输出/结果。

Answer 1

以下是将所有输出/结果列入列表的解决方案

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    dispatcher.connect(crawler_results, signal=signals.item_passed)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

Answer 2

由于crawl()，AFAIK没有办法做到这一点：

返回爬网结束时触发的延迟。

除了将结果输出到记录器之外，爬虫不会将结果存储在任何位置。

然而，返回的输出将与scrapy的整个异步性质和结构相冲突，因此保存到文件然后阅读它是一种优选的方法。您可以简单地设计将项目保存到文件的管道，并只需读取spider_output中的文件即可。您将收到您的结果，因为reactor.run()阻止您的脚本，直到输出文件完整无论如何。

Answer 3

我的建议是使用Python subprocess模块从脚本运行spider而不是使用scrapy docs中提供的方法从python脚本运行spider。原因是使用subprocess模块，您可以捕获输出/日志甚至是从蜘蛛内部print的语句。

在Python 3中，使用run方法执行spider。防爆。

import subprocess
process = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if process.returncode == 0:
    result = process.stdout.decode('utf-8')
else:
    # code to check error using 'process.stderr'

将stdout / stderr设置为subprocess.PIPE将允许捕获输出，因此设置此标志非常重要。这里command应该是一个序列或一个字符串（它是一个字符串，然后用另外一个param调用run方法：shell=True）。例如：

command = ['scrapy', 'crawl', 'website', '-a', 'customArg=blahblah']
# or
command = 'scrapy crawl website -a customArg=blahblah' # with shell=True
#or
import shlex
command = shlex.split('scrapy crawl website -a customArg=blahblah') # without shell=True

此外，process.stdout将包含脚本的输出，但它将是bytes类型。你需要使用str将它转换为decode('utf-8')

获取Scrapy爬虫输出/结果脚本文件功能

问题描述投票：4回答：3

3个回答

最新问题

获取Scrapy爬虫输出/结果脚本文件功能

问题描述 投票：4回答：3

3个回答

最新问题

问题描述投票：4回答：3