Python Scrapy 单页启动速度极慢

问题描述 投票:0回答:1

我是 Scrapy with Splash 的新手,希望得到一些建议。我正在尝试抓取网站https://www.canada.ca/en/revenue-agency/services/forms-publications/forms.html,其中包含政府表格列表。我的 Spider 与我遵循的 Scrapy 教程配合得很好。抓取 https://quotes.toscrape.com/ 只需要几秒钟即可运行。但对于我现在正在尝试的网站,在将单个页面的超时设置为 300 秒后,请求仍然超时!我一定做错了什么,但我不知道是什么。

这是蜘蛛的设置:

BOT_NAME = "quotes_js_scraper"
SPIDER_MODULES = ["quotes_js_scraper.spiders"]
NEWSPIDER_MODULE = "quotes_js_scraper.spiders"
ROBOTSTXT_OBEY = False
SPLASH_URL = "http://localhost:8050"
DOWNLOADER_MIDDLEWARES = {
    "scrapy_splash.SplashCookiesMiddleware": 723,
    "scrapy_splash.SplashMiddleware": 725,
    "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 810,
}
SPIDER_MIDDLEWARES = {
    "scrapy_splash.SplashDeduplicateArgsMiddleware": 100,
}
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"

# added this myself after getting a warning that the default
# '2.6' is deprecated
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"

然后这是蜘蛛:

from pathlib import Path
import scrapy
from scrapy_splash import SplashRequest

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        # url = "https://quotes.toscrape.com/"
        url = "https://www.canada.ca/en/revenue-agency/services/forms-publications/forms.html"

        yield SplashRequest(
            url,
            callback=self.parse,
            args={
                "wait": 1,
                "proxy": "http://scrapeops:[email protected]:5353",
                "timeout": 300,
            },
        )

    def parse(self, response):
        filename = "test_output.html"
        Path(filename).write_bytes(response.body)

这是终端中的输出

2024-01-07 11:39:55 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: quotes_js_scraper)
2024-01-07 11:39:55 [scrapy.utils.log] INFO: Versions: lxml 5.0.1.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform Windows-11-10.0.22621-SP0
2024-01-07 11:39:55 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-07 11:39:55 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-01-07 11:39:55 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxxxxxxxxxxxxxxxx
2024-01-07 11:39:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-01-07 11:39:55 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'quotes_js_scraper',
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'NEWSPIDER_MODULE': 'quotes_js_scraper.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_MODULES': ['quotes_js_scraper.spiders']}
2024-01-07 11:39:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-07 11:39:56 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-07 11:39:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-07 11:39:56 [scrapy.core.engine] INFO: Spider opened
2024-01-07 11:39:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-07 11:39:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-07 11:39:56 [py.warnings] WARNING: C:\Users\gmloo\OneDrive\Documents\Customtech Solutions\Products and Services\Scraping\quotes-js-project\venv\Lib\site-packages\scrapy_splash\dupefilter.py:20: ScrapyDeprecationWarning: Call to deprecated function scrapy.utils.request.request_fingerprint().

If you are using this function in a Scrapy component, and you are OK with users of your component changing the fingerprinting algorithm through settings, use crawler.request_fingerprinter.fingerprint() instead in your Scrapy component (you can get the crawler object from the 'from_crawler' class method).

Otherwise, consider using the scrapy.utils.request.fingerprint() function instead.

Either way, the resulting fingerprints will be returned as bytes, not as a string, and they will also be different from those generated by 'request_fingerprint()'. Before you switch, make sure that you understand the consequences of this (e.g. cache invalidation) and are OK with them; otherwise, consider implementing your own function which returns the same fingerprints as the deprecated 'request_fingerprint()' function.
  fp = request_fingerprint(request, include_headers=include_headers)

2024-01-07 11:40:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-07 11:41:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-07 11:42:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-07 11:43:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-07 11:44:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-07 11:44:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.canada.ca/en/revenue-agency/services/forms-publications/forms.html via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2024-01-07 11:45:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-07 11:46:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...

我注意到我在尝试抓取其他网站时得到的响应也非常慢。我的电脑速度很慢,但没那么慢。有什么办法可以加快速度吗?还是我的设置有误?例如,我想要的只是页面上的链接和文本,我不需要返回任何“script”标签的内容。我知道我可以在解析方法中删除它,但它只能在生成响应后运行,对吗?

谢谢你。

python scrapy splash-screen
1个回答
0
投票

以下是一些排除过程建议:

  1. 我看到您已设置
    ROBOTSTXT_OBEY = False
    尝试将其设置为True,看看是否有性能改进。
  2. 检查你的输出 html 文件没有变得太大,这会显着降低写入效率并减慢蜘蛛速度。
  3. 主要嫌疑人可能是 Splash 或 Splash 运行的代理。我会看一下这里的日志;如果 Splash 在 docker 中运行,您可能需要检查资源使用情况等。

希望这有帮助!

© www.soinside.com 2019 - 2024. All rights reserved.