有很多方法可以从脚本启动 scrapy 蜘蛛(文档)。但是当你在 Celery 中处理它时,它就变得有点复杂了。
我想要的是一个能够使用settings.py文件中的设置启动scrapy的功能
我的设置看起来像这样:
import os
import traceback
from twisted.internet import reactor
from celery import shared_task
from billiard.context import Process
from scrapy.crawler import CrawlerRunner
from my_spider.spiders.spider import MySpider
from scrapy.utils.project import get_project_settings
@shared_task
def start_scrapy(link):
run_spider(link)
def run_spider(link):
def _crawl(spider, *args, **kwargs):
try:
os.environ.setdefault("SCRAPY_SETTINGS_MODULE", "my_spider.settings")
settings = get_project_settings()
settings.update(
{
"FEEDS": {
f"output-{args[0]}.json": {
"format": "json",
"encoding": "utf-8",
"overwrite": True,
},
},
}
)
runner = CrawlerRunner(settings)
deferred = runner.crawl(spider, *args, **kwargs)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
except Exception as e:
print(f"Exception: {e}")
traceback.print_exc()
process = Process(target=_crawl, args=(MySpider, link))
process.start()
process.join()
现在,如果没有环境部分,它也可以工作,但它不会具有 settings.py 文件中的设置。由于某种原因,错误是无声的,所以我不得不在 scrapy 源代码中添加带有flush=True 的打印。因此我发现了以下错误:
The installed reactor (twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
通过打印,我检查了 get_project_settings 是否返回所需的设置。
我尝试过调用 asyncioreactor.install()
然后再导入反应器,但没有成功。我使用 macOS。
Celery 配置如下所示: 芹菜.py
import os
from celery import Celery
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "listings.settings")
app = Celery("listings")
app.config_from_object("django.conf:settings", namespace="CELERY")
app.conf.update(
worker_concurrency=4,
worker_prefetch_multiplier=1,
)
app.autodiscover_tasks()
设置.py
REDIS_HOST = "localhost"
REDIS_PORT = 6379
CELERY_BROKER_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}"
CELERY_RESULT_BACKEND = f"redis://{REDIS_HOST}:{REDIS_PORT}"
CELERY_TASK_TRACK_STARTED = True
CELERY_RESULT_EXTENDED = True
CELERY_RESULT_EXPIRES = 360
套餐版本:
celery==5.3.6
Scrapy==2.11.1
Twisted==23.10.0
Django==4.2.4
请帮我找到问题的解决方案。
从设置中删除扭曲反应堆settings.delete("TWISTED_REACTOR")