我正在尝试设置 scrapy-selenium 来进行一些抓取: pip 安装了 scrappy、scrapy-selenium;下载并放入我的项目目录chromedriver.exe,更新setting.py:
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
还尝试使用 Chromedriver 位置的完整路径而不仅仅是哪个函数,但我收到此错误,我不知道为什么:
2023-06-20 10:48:59 [扭曲] 严重:延迟中未处理的错误:
Traceback (most recent call last):
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\crawler.py", line 240, in crawl
return self._crawl(crawler, *args, **kwargs)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\crawler.py", line 244, in _crawl
d = crawler.crawl(*args, **kwargs)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\twisted\internet\defer.py", line 1947, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\twisted\internet\defer.py", line 1857, in _cancellableInlineCallbacks
_inlineCallbacks(None, gen, status, _copy_context())
--- <exception caught here> ---
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks
result = context.run(gen.send, result)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\crawler.py", line 129, in crawl
self.engine = self._create_engine()
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\crawler.py", line 143, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\core\engine.py", line 100, in __init__
self.downloader: Downloader = downloader_cls(crawler)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\core\downloader\__init__.py", line 97, in __init__
DownloaderMiddlewareManager.from_crawler(crawler)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\middleware.py", line 68, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\middleware.py", line 44, in from_settings
mw = create_instance(mwcls, settings, crawler)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\utils\misc.py", line 170, in create_instance
instance = objcls.from_crawler(crawler, *args, **kwargs)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy_selenium\middlewares.py", line 67, in from_crawler
middleware = cls(
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy_selenium\middlewares.py", line 51, in __init__
self.driver = driver_klass(**driver_kwargs)
builtins.TypeError: WebDriver.__init__() got an unexpected keyword argument 'executable_path'
2023-06-20 10:48:59 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks
result = context.run(gen.send, result)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\crawler.py", line 129, in crawl
self.engine = self._create_engine()
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\crawler.py", line 143, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\core\engine.py", line 100, in __init__
self.downloader: Downloader = downloader_cls(crawler)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\core\downloader\__init__.py", line 97, in __init__
DownloaderMiddlewareManager.from_crawler(crawler)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\middleware.py", line 68, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\middleware.py", line 44, in from_settings
mw = create_instance(mwcls, settings, crawler)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy\utils\misc.py", line 170, in create_instance
instance = objcls.from_crawler(crawler, *args, **kwargs)
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy_selenium\middlewares.py", line 67, in from_crawler
middleware = cls(
File "C:\Users\denis\Desktop\Scrapy_Study\pythonProject\venv\Lib\site-packages\scrapy_selenium\middlewares.py", line 51, in __init__
self.driver = driver_klass(**driver_kwargs)
TypeError: WebDriver.__init__() got an unexpected keyword argument 'executable_path'
任何人都可以帮忙解决这个问题吗?
我在这篇 github 帖子中帮助解决了这个问题:https://github.com/clemfromspace/scrapy-selenium/issues/128。请注意,我使用 scrapy 创建网络抓取工具,并使用 Selenium 与网站交互。
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = None #not actually necessary, will work even if you comment this line out
SELENIUM_DRIVER_ARGUMENTS=[] #put '--headless' in the brackets to prevent browser popup
scrapy runspider <scraper_name>.py
并享受!快速解释正在发生的事情:
Selenium 在生成 Web 驱动程序对象时从使用executable_path 更改为服务对象。这些更改不包含在当前版本的 Scrapy-selenium 包中。要解决这个问题,我建议:
在 GitHub 上分叉该项目:https://github.com/clemfromspace/scrapy-selenium/fork
在
scrapy_selenium/middlewares.py
中创建一个服务对象,并在创建 web_driver 对象时传递它而不是executable_path(类似于此 PR 中的更改:https://github.com/clemfromspace/scrapy-selenium/pull/135/文件)。
if driver_executable_path is not None:
service_module = import_module(f'{webdriver_base_path}.service')
service_klass = getattr(service_module, 'Service')
service_kwargs = {
'executable_path': driver_executable_path,
}
service = service_klass(**service_kwargs)
driver_kwargs = {
'service': service,
'options': driver_options
}
self.driver = driver_klass(**driver_kwargs)
使用
python -m unittest discover -p "test_*.py"
运行单元测试以确认一切仍然按预期工作。
提交并推动您的更改
pip卸载scrapy-selenium
pip install git+{https://your_repository}
注意:在项目中设置包时,您可以使用
settings.py
中的相同配置