如何从Python脚本中运行Scrapy

问题描述 投票:46回答:7

我是新来的Scrapy和我正在寻找一种方法来从一个Python脚本运行它。我发现2个来源解释:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

我想不通,我应该把我的蜘蛛的代码以及如何将其从main函数调用。请帮忙。这是示例代码:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

谢谢。

python web-scraping web-crawler scrapy
7个回答
57
投票

所有其他的答案中引用Scrapy v0.x.据the updated docs,Scrapy 1.0的要求:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

14
投票

虽然我还没有尝试过,我认为,答案可以在scrapy documentation内被发现。要直接从它引述如下:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

从我所收集这是图书馆新的发展这使得一些早期的方法在线(例如,在问题)已经过时了。


12
投票

在scrapy 0.19.x,你应该这样做:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

注意这些线

settings = get_project_settings()
crawler = Crawler(settings)

没有它,你的蜘蛛不会将您的设置,并且不会保存项目。我花了一段时间才能找出为什么在文档中的例子不是救了我的项目。我发了拉请求解决文档的例子。

还有一个这样做的仅仅是直接从你脚本中调用命令

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

在这里复制这个答案从我的第一个答案:https://stackoverflow.com/a/19060485/1402286


7
投票

当有多个爬虫需要一个python脚本内部运行,反应堆停止需要小心处理的反应器只能停一次,无法重新启动。

然而,我发现,而这样做使用我的项目

os.system("scrapy crawl yourspider")

是最简单的。这将节省我的处理尤其是当我有多个蜘蛛各种信号。

如果性能是一个问题,你可以使用多重并行运行你的蜘蛛,是这样的:

def _crawl(spider_name=None):
    if spider_name:
        os.system('scrapy crawl %s' % spider_name)
    return None

def run_crawler():

    spider_names = ['spider1', 'spider2', 'spider2']

    pool = Pool(processes=len(spider_names))
    pool.map(_crawl, spider_names)

1
投票

只要我们可以使用

from scrapy.crawler import CrawlerProcess
from project.spiders.test_spider import SpiderName

process = CrawlerProcess()
process.crawl(SpiderName, arg1=val1,arg2=val2)
process.start()

使用蜘蛛__init__函数内这些参数与全球范围。


-3
投票
# -*- coding: utf-8 -*-
import sys
from scrapy.cmdline import execute


def gen_argv(s):
    sys.argv = s.split()


if __name__ == '__main__':
    gen_argv('scrapy crawl abc_spider')
    execute()

将这个代码,你可以从命令行运行scrapy crawl abc_spider的路径。 (测试用Scrapy == 0.24.6)


-3
投票

如果你想运行一个简单的爬行,这是由刚刚运行命令很简单:

scrapy爬行。还有另一种选择,导出的结果在一定的格式存储,如:JSON,XML,CSV。

scrapy爬行-o result.csv或result.json或为result.xml。

你可能会想尝试一下

© www.soinside.com 2019 - 2024. All rights reserved.