将参数传递给Scrapy python中的process.crawl

问题描述 投票:23回答:2

我希望获得与此命令行相同的结果:抓取抓取linkedin_anonymous -a first = James -a last = Bond -o output.json

我的脚本如下:

import scrapy
from linkedin_anonymous_spider import LinkedInAnonymousSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

spider = LinkedInAnonymousSpider(None, "James", "Bond")
process = CrawlerProcess(get_project_settings())
process.crawl(spider) ## <-------------- (1)
process.start()

我发现(1)中的process.crawl()正在创建另一个LinkedInAnonymousSpider,其中first和last为None(在(2)中打印),如果是的话,那么就没有必要创建对象蜘蛛,以及如何创建它可以先将参数最后传递给process.crawl()吗?

linkedin_anonymous:

from logging import INFO

import scrapy

class LinkedInAnonymousSpider(scrapy.Spider):
    name = "linkedin_anonymous"
    allowed_domains = ["linkedin.com"]
    start_urls = []

    base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search"

    def __init__(self, input = None, first= None, last=None):
        self.input = input  # source file name
        self.first = first
        self.last = last

    def start_requests(self):
        print self.first ## <------------- (2)
        if self.first and self.last: # taking input from command line parameters
                url = self.base_url % (self.first, self.last)
                yield self.make_requests_from_url(url)

    def parse(self, response): . . .
python web-crawler scrapy scrapy-spider google-crawlers
2个回答
47
投票
process.crawl方法上传递蜘蛛参数:

process.crawl(spider, input='inputargument', first='James', last='Bond')


0
投票
您可以轻松实现:

from scrapy import cmdline cmdline.execute("scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json".split())

© www.soinside.com 2019 - 2024. All rights reserved.