Scrapy 按特定顺序执行管道

Question

我有几个蜘蛛，它们被设置为一个接一个地被处决，就像

SETTINGS = {
  ...,
  "ITEM_PIPELINES": {
    "pipelines.my_spider_pipeline.MySpiderPipeline": 1,
    "pipelines.my_images_pipeline.MyImagesPipeline": 2,
  },
}

这似乎没有按预期工作，我不确定是否是因为

pipelines.my_spider_pipeline.MySpiderPipeline

中的代码；

class MySpiderPipeline(object):
    def __init__(self, stats):
        self.stats = stats


    @classmethod
    def from_crawler(cls, crawler):
        spider = cls(crawler.stats)
        crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
        return spider

stats

参数用于传递 StatsCollector 类。

现在，每当我的代码执行时，它都会首先转到

from_crawler

，然后跳转到

MyImagesPipeline

中定义的另一个函数，但我需要它转到

process_item

中的

MySpiderPipeline

，因为它就在那里我'我正在向数据库中插入数据，并且我需要数据库记录的 id 在

MyImagesPipeline

中一次可用。

为此要做什么？我认为这段代码根本不灵活，任何可能的更改都意味着移动大量代码。欢迎任何建议。

尝试不使用

from_crawler

，但没有改变任何东西。

Answer 1

首先，您需要将 id 定义为您的

item

定义的一部分。在

process_item

类的

MySpiderPipeline

方法中，您需要获取插入数据库中的项目的

id

并将其保存为

item

属性的一部分。

class MySpiderPipeline:
    def process_item(self, item, spider):
        # insert item in db and get back the id inserted
        # code here

        # add the id returned to the item and return it
        item[id] = 'id'
        return item

在

process_item

类的

MyImagesPipeline

方法中，您需要检索在

id

类中设置的

MySpiderPipeline

的值并在适用时使用它。

class MyImagesPipeline:
    def process_item(self, item, spider):
        # retrive the id value that is part of the item
        id = item["id"]

        # use the id value as needed
        # code here
        return item

Scrapy 按特定顺序执行管道

问题描述投票：0回答：1

1个回答

最新问题

Scrapy 按特定顺序执行管道

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1