如何使用 chromium 引擎将 scrapy_splash 与 lua 结合使用?

问题描述 投票:0回答:1

你好, 我正在尝试为使用 javascript 的网站制作抓取机器人。我有大约 20 个来自该网站的 url,并且希望扩展到数百个,我需要经常抓取这些 url,所以我尝试使用 lua 脚本来实现“动态”等待时间。当我使用默认的 webkit 引擎时,网站的 html 输出只是表明该网站不支持此浏览器的文本,这就是我使用 chromium 引擎的原因。如果没有 lua 脚本,抓取只能通过 Chromium 引擎提供输出项,但它确实有效。在我用 lua 尝试后,我在 chromium 引擎上遇到了错误,而在 webkit 上它执行时没有错误,但没有给出任何输出项。这是我在 lua 中使用的启动请求:

def start_requests(self):
        lua_script = """
        function main(splash, args)
            assert(splash:go(args.url))

            while not splash:select('div.o-matchRow')
                splash:wait(1)
                print('waiting...')
            end
            return {html=splash:html()}
        end    
        """

        for url in self.start_urls:
            yield SplashRequest(
                url=url,
                callback=self.parse,
                endpoint='execute',
                args={'engine': 'chromium', 'lua_source': lua_script}
            )

我想测试一下这个简单的东西。有谁知道 lua 和 chromium 引擎有什么关系,或者当网站不支持 webkit 时如何使用它? (顺便说一句,对不起我的英语,我不是母语人士) 这些是 chromium 引擎的错误:

2023-12-04 21:23:54 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tipsport_scraper)
2023-12-04 21:23:54 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)
], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.5, Platform Windows-10-10.0.19045-SP0
2023-12-04 21:23:54 [scrapy.addons] INFO: Enabled addons:                                                               
[]                                                                                                                      
2023-12-04 21:23:54 [asyncio] DEBUG: Using selector: SelectSelector                                                     
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor     
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet Password: **************
2023-12-04 21:23:54 [py.warnings] WARNING: C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\extensions\feedexport.py:406: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been
 deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
  exporter = cls(crawler)

2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-12-04 21:23:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tipsport_scraper',
 'CONCURRENT_REQUESTS': 5,
 'DOWNLOAD_DELAY': 5,
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'NEWSPIDER_MODULE': 'tipsport_scraper.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_MODULES': ['tipsport_scraper.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled item pipelines:
['tipsport_scraper.pipelines.TipsportScraperPipeline']
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider opened
2023-12-04 21:23:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet console listening on **********
2023-12-04 21:23:54 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tipsport.cz/kurzy/fotbal-16?limit=1000 via http://localhost:8050/execute>
Traceback (most recent call last):
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 68, in process_response
    method(request=request, response=response, spider=spider)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 412, in process_response
    response = self._change_response_class(request, response)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 433, in _change_response_class
    response = response.replace(cls=respcls, request=request)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\http\response\__init__.py", line 125, in replace
    return cls(*args, **kwargs)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 120, in __init__
    self._load_from_json()
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 174, in _load_from_json
    error = self.data['info']['error']
TypeError: string indices must be integers, not 'str'
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-04 21:23:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1045,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 255,
 'downloader/response_count': 1,
 'downloader/response_status_count/400': 1,
 'elapsed_time_seconds': 0.233518,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 847285, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/400': 1,
 'start_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 613767, tzinfo=datetime.timezone.utc)}
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider closed (finished)

我删除了 Telenet 密码和某种 IP,以防万一它是敏感内容,我将它们替换为 *。

python web-scraping lua scrapy scrapy-splash
1个回答
0
投票

对于 Chromium,请确保您的 Splash 设置正确以处理 Chromium 请求。如果仍然不起作用,更新 Splash 可能会有所帮助。

对于 WebKit,该网站似乎阻止了它,因此请尝试将 Scrapy 中的用户代理更改为更常见的内容。另外,检查您在 Lua 脚本中等待的

div.o-matchRow
是否确实存在于站点上。如果确实如此并且您仍然遇到问题,请尝试设置脚本等待时间的限制以避免卡住。

日志中的

TypeError
表明您的脚本处理响应的方式存在问题。确保您在脚本中正确处理数据格式。

© www.soinside.com 2019 - 2024. All rights reserved.