现在已经有好几天了,我在Main.py中遇到了Scrapy / twisted问题,它应该运行不同的蜘蛛并分析它们的输出。不幸的是,MySpider2依赖来自MySpider1]的FEED,因此只能在MySpider1完成后才能运行。此外,MySpider1和MySpider2具有不同的设置。到目前为止,我还没有找到了一个解决方案,可让我使用各自的唯一设置按顺序运行蜘蛛。我查看了Scrapy CrawlerRunner和CrawlerProcess docs,并尝试了几个相关的stackoverflow问题(Run Multiple Spider sequentially,Scrapy: how to run two crawlers one after another?,Scrapy run multiple spiders from a script等),但均未成功。
按照有关顺序蜘蛛的文档,我的代码(略有修改)为:
from MySpider1.myspider1.spiders.myspider1 import MySpider1 from MySpider2.myspider2.spiders.myspider2 import MySpider2 from twisted.internet import defer, reactor from scrapy.crawler import CrawlerProcess from scrapy.crawler import CrawlerRunner spider_settings = [{ 'FEED_URI':'abc.csv', 'LOG_FILE' :'abc/log.log' #MORE settings are here },{ 'FEED_URI' : '123.csv', 'LOG_FILE' :'123/log.log' #MORE settings are here }] spiders = [MySpider1, MySpider2] process = CrawlerRunner(spider_settings[0]) process = CrawlerRunner(spider_settings[1]) #Not sure if this is how its supposed to be used for #multiple settings but passing this line before "yield process.crawl(spiders[1])" also results in an error. @defer.inlineCallbacks def crawl(): yield process.crawl(spiders[0]) yield process.crawl(spiders[1]) reactor.stop() crawl() reactor.run()
但是,使用此代码,只有第一个蜘蛛被执行并且没有任何设置。因此,我尝试使用CrawlerProcess产生更多效果:
from MySpider1.myspider1.spiders.myspider1 import MySpider1 from MySpider2.myspider2.spiders.myspider2 import MySpider2 from twisted.internet import defer, reactor from scrapy.crawler import CrawlerProcess from scrapy.crawler import CrawlerRunner spider_settings = [{ 'FEED_URI':'abc.csv', 'LOG_FILE' :'abc/log.log' #MORE settings are here },{ 'FEED_URI' : '123.csv', 'LOG_FILE' :'123/log.log' #MORE settings are here }] spiders = [MySpider1, MySpider2] process = CrawlerProcess(spider_settings[0]) process = CrawlerProcess(spider_settings[1]) @defer.inlineCallbacks def crawl(): yield process.crawl(spiders[0]) yield process.crawl(spiders[1]) reactor.stop() crawl() reactor.run()
此代码执行两个蜘蛛程序,但同时执行,而不是按预期顺序执行。此外,它还会在一秒钟后用spider [1]的设置覆盖spider [0]的设置,导致日志文件仅在两行之后被切断,并在123 / log.log中对两个蜘蛛恢复。
在理想的世界中,我的代码段将按以下方式工作:
提前感谢您的帮助。
[几天以来,我在Main.py中遇到了Scrapy / twisted问题,它应该运行不同的蜘蛛并分析它们的输出。不幸的是,MySpider2依赖于...