scrapy-splash 似乎工作正常,但不保存动态加载的内容

问题描述 投票:0回答:0

我正在玩 scrapy-splash 以了解如何处理文档中提到的动态加载的内容。

调试时,response.body 似乎包含由 javascript 填充的正确内容。但是,保存的文件仅包含原始 html。请帮我找出我在这里失踪的东西。

我尝试运行和调试 VSCode 中的代码以查看 reponse.body 中存储的值。

我的代码:

from pathlib import Path

import scrapy
from scrapy_splash import SplashRequest


class SplashSpider(scrapy.Spider):
    name = "splash_spdr"

    def start_requests(self):
        urls = [
            'https://quotes.toscrape.com/scroll',
        ]

        for url in urls:
            yield SplashRequest(url, self.parse,
                args={
                    #optional; parameters passed to Splash HTTP API
                    'wait': 0.5,

                    # 'url' is prefilled from request url
                    # 'http_method' is set to 'POST' for POST requests
                    # 'body' is set to request body for POST requests
                },
                # endpont='render.json', # optional; default is render.html
                # splash_url='<URL>', # optional; overrides SPLASH_URL
                # slot_policy=SlotPolicy.PER_DOMAIN, # optional;
            )
    
    def parse(self, response):
        page = response.url.split("/")[-1]
        filename = f'quotes-{page}.html'
        Path(filename).write_bytes(response.body)
        self.log(f'Saved file {filename}')

日志是:

2023-04-30 21:03:25 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: tutorial)
2023-04-30 21:03:25 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
2023-04-30 21:03:25 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'BOT_NAME': 'tutorial',
 'DOWNLOAD_DELAY': 3,
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'HTTPCACHE_ENABLED': True,
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-04-30 21:03:25 [asyncio] DEBUG: Using selector: EpollSelector
2023-04-30 21:03:25 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-04-30 21:03:25 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-04-30 21:03:25 [scrapy.extensions.telnet] INFO: Telnet Password: e7388e1e607a34a1
2023-04-30 21:03:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2023-04-30 21:03:25 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'tutorial.middlewares.TutorialDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2023-04-30 21:03:25 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'tutorial.middlewares.TutorialSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-04-30 21:03:25 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-04-30 21:03:25 [scrapy.core.engine] INFO: Spider opened
2023-04-30 21:03:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-04-30 21:03:25 [splash_spdr] INFO: Spider opened: splash_spdr
2023-04-30 21:03:25 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in /home/kkas/scrapy_tutorial/tutorial/.scrapy/httpcache
2023-04-30 21:03:25 [splash_spdr] INFO: Spider opened: splash_spdr
2023-04-30 21:03:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2023-04-30 21:03:25 [py.warnings] WARNING: /home/kkas/scrapy_tutorial/.venv/lib/python3.10/site-packages/scrapy_splash/dupefilter.py:20: ScrapyDeprecationWarning: Call to deprecated function scrapy.utils.request.request_fingerprint().

If you are using this function in a Scrapy component, and you are OK with users of your component changing the fingerprinting algorithm through settings, use crawler.request_fingerprinter.fingerprint() instead in your Scrapy component (you can get the crawler object from the 'from_crawler' class method).

Otherwise, consider using the scrapy.utils.request.fingerprint() function instead.

Either way, the resulting fingerprints will be returned as bytes, not as a string, and they will also be different from those generated by 'request_fingerprint()'. Before you switch, make sure that you understand the consequences of this (e.g. cache invalidation) and are OK with them; otherwise, consider implementing your own function which returns the same fingerprints as the deprecated 'request_fingerprint()' function.
  fp = request_fingerprint(request, include_headers=include_headers)

2023-04-30 21:03:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None) ['cached']
2023-04-30 21:03:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/scroll via http://localhost:8050/render.html> (referer: None) ['cached']
2023-04-30 21:03:47 [splash_spdr] DEBUG: b'<!DOCTYPE html><html lang="en"><head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n<div class="row">\n    <div class="col-md-8">\n        <div class="quotes"><div class="quote"><span class="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\x9d</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cIt is our choices, Harry, that show what we truly are, far more than our abilities.\xe2\x80\x9d</span><span>by <small class="author">J.K. Rowling</small></span><div class="tags">Tags: <a class="tag">abilities</a> <a class="tag">choices</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xe2\x80\x9d</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">inspirational</a> <a class="tag">life</a> <a class="tag">live</a> <a class="tag">miracle</a> <a class="tag">miracles</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xe2\x80\x9d</span><span>by <small class="author">Jane Austen</small></span><div class="tags">Tags: <a class="tag">aliteracy</a> <a class="tag">books</a> <a class="tag">classic</a> <a class="tag">humor</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cImperfection is beauty, madness is genius and it\'s better to be absolutely ridiculous than absolutely boring.\xe2\x80\x9d</span><span>by <small class="author">Marilyn Monroe</small></span><div class="tags">Tags: <a class="tag">be-yourself</a> <a class="tag">inspirational</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cTry not to become a man of success. Rather become a man of value.\xe2\x80\x9d</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">adulthood</a> <a class="tag">success</a> <a class="tag">value</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cIt is better to be hated for what you are than to be loved for what you are not.\xe2\x80\x9d</span><span>by <small class="author">Andr\xc3\xa9 Gide</small></span><div class="tags">Tags: <a class="tag">life</a> <a class="tag">love</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cI have not failed. I\'ve just found 10,000 ways that won\'t work.\xe2\x80\x9d</span><span>by <small class="author">Thomas A. Edison</small></span><div class="tags">Tags: <a class="tag">edison</a> <a class="tag">failure</a> <a class="tag">inspirational</a> <a class="tag">paraphrased</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cA woman is like a tea bag; you never know how strong it is until it\'s in hot water.\xe2\x80\x9d</span><span>by <small class="author">Eleanor Roosevelt</small></span><div class="tags">Tags: <a class="tag">misattributed-eleanor-roosevelt</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cA day without sunshine is like, you know, night.\xe2\x80\x9d</span><span>by <small class="author">Steve Martin</small></span><div class="tags">Tags: <a class="tag">humor</a> <a class="tag">obvious</a> <a class="tag">simile</a></div></div></div>\n    </div>\n</div>\n<div id="loading" style="background-color: rgb(238, 238, 204); display: none;"><h5>Loading...</h5></div>\n<script src="/static/jquery.js"></script>\n<script>\n    $(function(){\n        var page = 1, tag = null, hasNextPage = true;\n        function appendQuotes(quotes) {\n            var $quotes = $(\'.quotes\');\n            var html = $.map(quotes, function(d){\n                var tags = $.map(d[\'tags\'], function(t) {\n                    return "<a class=\'tag\'>" + t + "</a>";\n                }).join(" ");\n                return "<div class=\'quote\'><span class=\'text\'>" + d[\'text\'] + "</span><span>by <small class=\'author\'>" + d[\'author\'][\'name\'] + "</small></span><div class=\'tags\'>Tags: " + tags + "</div></div>";\n            });\n\n            $quotes.append(html);\n        }\n\n        function updatePage(page) {\n            $(\'#loading\').show(\'fast\');\n            $.get(\'/api/quotes\', {page: page}).done(function(data) {\n                appendQuotes(data.quotes);\n                hasNextPage = data.has_next;\n                $(\'#loading\').hide(\'fast\');\n            });\n        }\n        updatePage(page);\n        $(window).on(\'scroll\', function(){\n            var scrollTop = $(window).scrollTop();\n            var heightDiff = $(document).height() - $(window).height();\n            if (hasNextPage && Math.abs(scrollTop - heightDiff) <= 1){\n                page += 1;\n                console.log(\'scrolling to page: \' + page);\n                updatePage(page);\n            }\n        });\n    });\n</script>\n\n    </div>\n    <footer class="footer">\n        <div class="container">\n            <p class="text-muted">\n                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>\n            </p>\n            <p class="copyright">\n                Made with <span class="sh-red">\xe2\x9d\xa4</span> by <a href="https://scrapinghub.com">Scrapinghub</a>\n            </p>\n        </div>\n    </footer>\n\n</body></html>'
2023-04-30 21:04:58 [splash_spdr] DEBUG: Saved file quotes-scroll.html
2023-04-30 21:04:58 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2023-04-30 21:04:58 [scrapy.core.engine] INFO: Closing spider (finished)
2023-04-30 21:04:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 751,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 6414,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 93.09357,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 4, 30, 12, 4, 58, 579719),
 'httpcache/hit': 2,
 'log_count/DEBUG': 8,
 'log_count/INFO': 13,
 'log_count/WARNING': 1,
 'memusage/max': 83218432,
 'memusage/startup': 79970304,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2023, 4, 30, 12, 3, 25, 486149)}
2023-04-30 21:04:58 [scrapy.core.engine] INFO: Spider closed (finished)

保存的文件:

<!DOCTYPE html><html lang="en"><head>
    <meta charset="UTF-8">
    <title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    
<div class="row">
    <div class="col-md-8">
        <div class="quotes"><div class="quote"><span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div><div class="quote"><span class="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span><span>by <small class="author">J.K. Rowling</small></span><div class="tags">Tags: <a class="tag">abilities</a> <a class="tag">choices</a></div></div><div class="quote"><span class="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">inspirational</a> <a class="tag">life</a> <a class="tag">live</a> <a class="tag">miracle</a> <a class="tag">miracles</a></div></div><div class="quote"><span class="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span><span>by <small class="author">Jane Austen</small></span><div class="tags">Tags: <a class="tag">aliteracy</a> <a class="tag">books</a> <a class="tag">classic</a> <a class="tag">humor</a></div></div><div class="quote"><span class="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span><span>by <small class="author">Marilyn Monroe</small></span><div class="tags">Tags: <a class="tag">be-yourself</a> <a class="tag">inspirational</a></div></div><div class="quote"><span class="text">“Try not to become a man of success. Rather become a man of value.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">adulthood</a> <a class="tag">success</a> <a class="tag">value</a></div></div><div class="quote"><span class="text">“It is better to be hated for what you are than to be loved for what you are not.”</span><span>by <small class="author">André Gide</small></span><div class="tags">Tags: <a class="tag">life</a> <a class="tag">love</a></div></div><div class="quote"><span class="text">“I have not failed. I've just found 10,000 ways that won't work.”</span><span>by <small class="author">Thomas A. Edison</small></span><div class="tags">Tags: <a class="tag">edison</a> <a class="tag">failure</a> <a class="tag">inspirational</a> <a class="tag">paraphrased</a></div></div><div class="quote"><span class="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span><span>by <small class="author">Eleanor Roosevelt</small></span><div class="tags">Tags: <a class="tag">misattributed-eleanor-roosevelt</a></div></div><div class="quote"><span class="text">“A day without sunshine is like, you know, night.”</span><span>by <small class="author">Steve Martin</small></span><div class="tags">Tags: <a class="tag">humor</a> <a class="tag">obvious</a> <a class="tag">simile</a></div></div></div>
    </div>
</div>
<div id="loading" style="background-color: rgb(238, 238, 204); display: none;"><h5>Loading...</h5></div>
<script src="/static/jquery.js"></script>
<script>
    $(function(){
        var page = 1, tag = null, hasNextPage = true;
        function appendQuotes(quotes) {
            var $quotes = $('.quotes');
            var html = $.map(quotes, function(d){
                var tags = $.map(d['tags'], function(t) {
                    return "<a class='tag'>" + t + "</a>";
                }).join(" ");
                return "<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>";
            });

            $quotes.append(html);
        }

        function updatePage(page) {
            $('#loading').show('fast');
            $.get('/api/quotes', {page: page}).done(function(data) {
                appendQuotes(data.quotes);
                hasNextPage = data.has_next;
                $('#loading').hide('fast');
            });
        }
        updatePage(page);
        $(window).on('scroll', function(){
            var scrollTop = $(window).scrollTop();
            var heightDiff = $(document).height() - $(window).height();
            if (hasNextPage && Math.abs(scrollTop - heightDiff) <= 1){
                page += 1;
                console.log('scrolling to page: ' + page);
                updatePage(page);
            }
        });
    });
</script>

    </div>
    <footer class="footer">
        <div class="container">
            <p class="text-muted">
                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
            </p>
            <p class="copyright">
                Made with <span class="sh-red">❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a>
            </p>
        </div>
    </footer>

</body></html>
scrapy dynamically-generated scrapy-splash
© www.soinside.com 2019 - 2024. All rights reserved.