我正在玩 scrapy-splash 以了解如何处理文档中提到的动态加载的内容。
调试时,response.body 似乎包含由 javascript 填充的正确内容。但是,保存的文件仅包含原始 html。请帮我找出我在这里失踪的东西。
我尝试运行和调试 VSCode 中的代码以查看 reponse.body 中存储的值。
我的代码:
from pathlib import Path
import scrapy
from scrapy_splash import SplashRequest
class SplashSpider(scrapy.Spider):
name = "splash_spdr"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/scroll',
]
for url in urls:
yield SplashRequest(url, self.parse,
args={
#optional; parameters passed to Splash HTTP API
'wait': 0.5,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
# endpont='render.json', # optional; default is render.html
# splash_url='<URL>', # optional; overrides SPLASH_URL
# slot_policy=SlotPolicy.PER_DOMAIN, # optional;
)
def parse(self, response):
page = response.url.split("/")[-1]
filename = f'quotes-{page}.html'
Path(filename).write_bytes(response.body)
self.log(f'Saved file {filename}')
日志是:
2023-04-30 21:03:25 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: tutorial)
2023-04-30 21:03:25 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
2023-04-30 21:03:25 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'BOT_NAME': 'tutorial',
'DOWNLOAD_DELAY': 3,
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'FEED_EXPORT_ENCODING': 'utf-8',
'HTTPCACHE_ENABLED': True,
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'tutorial.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['tutorial.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-04-30 21:03:25 [asyncio] DEBUG: Using selector: EpollSelector
2023-04-30 21:03:25 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-04-30 21:03:25 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-04-30 21:03:25 [scrapy.extensions.telnet] INFO: Telnet Password: e7388e1e607a34a1
2023-04-30 21:03:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2023-04-30 21:03:25 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'tutorial.middlewares.TutorialDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2023-04-30 21:03:25 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'tutorial.middlewares.TutorialSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-04-30 21:03:25 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-04-30 21:03:25 [scrapy.core.engine] INFO: Spider opened
2023-04-30 21:03:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-04-30 21:03:25 [splash_spdr] INFO: Spider opened: splash_spdr
2023-04-30 21:03:25 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in /home/kkas/scrapy_tutorial/tutorial/.scrapy/httpcache
2023-04-30 21:03:25 [splash_spdr] INFO: Spider opened: splash_spdr
2023-04-30 21:03:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2023-04-30 21:03:25 [py.warnings] WARNING: /home/kkas/scrapy_tutorial/.venv/lib/python3.10/site-packages/scrapy_splash/dupefilter.py:20: ScrapyDeprecationWarning: Call to deprecated function scrapy.utils.request.request_fingerprint().
If you are using this function in a Scrapy component, and you are OK with users of your component changing the fingerprinting algorithm through settings, use crawler.request_fingerprinter.fingerprint() instead in your Scrapy component (you can get the crawler object from the 'from_crawler' class method).
Otherwise, consider using the scrapy.utils.request.fingerprint() function instead.
Either way, the resulting fingerprints will be returned as bytes, not as a string, and they will also be different from those generated by 'request_fingerprint()'. Before you switch, make sure that you understand the consequences of this (e.g. cache invalidation) and are OK with them; otherwise, consider implementing your own function which returns the same fingerprints as the deprecated 'request_fingerprint()' function.
fp = request_fingerprint(request, include_headers=include_headers)
2023-04-30 21:03:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None) ['cached']
2023-04-30 21:03:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/scroll via http://localhost:8050/render.html> (referer: None) ['cached']
2023-04-30 21:03:47 [splash_spdr] DEBUG: b'<!DOCTYPE html><html lang="en"><head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n <link rel="stylesheet" href="/static/bootstrap.min.css">\n <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n <div class="container">\n <div class="row header-box">\n <div class="col-md-8">\n <h1>\n <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n </h1>\n </div>\n <div class="col-md-4">\n <p>\n \n <a href="/login">Login</a>\n \n </p>\n </div>\n </div>\n \n<div class="row">\n <div class="col-md-8">\n <div class="quotes"><div class="quote"><span class="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\x9d</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cIt is our choices, Harry, that show what we truly are, far more than our abilities.\xe2\x80\x9d</span><span>by <small class="author">J.K. Rowling</small></span><div class="tags">Tags: <a class="tag">abilities</a> <a class="tag">choices</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xe2\x80\x9d</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">inspirational</a> <a class="tag">life</a> <a class="tag">live</a> <a class="tag">miracle</a> <a class="tag">miracles</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xe2\x80\x9d</span><span>by <small class="author">Jane Austen</small></span><div class="tags">Tags: <a class="tag">aliteracy</a> <a class="tag">books</a> <a class="tag">classic</a> <a class="tag">humor</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cImperfection is beauty, madness is genius and it\'s better to be absolutely ridiculous than absolutely boring.\xe2\x80\x9d</span><span>by <small class="author">Marilyn Monroe</small></span><div class="tags">Tags: <a class="tag">be-yourself</a> <a class="tag">inspirational</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cTry not to become a man of success. Rather become a man of value.\xe2\x80\x9d</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">adulthood</a> <a class="tag">success</a> <a class="tag">value</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cIt is better to be hated for what you are than to be loved for what you are not.\xe2\x80\x9d</span><span>by <small class="author">Andr\xc3\xa9 Gide</small></span><div class="tags">Tags: <a class="tag">life</a> <a class="tag">love</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cI have not failed. I\'ve just found 10,000 ways that won\'t work.\xe2\x80\x9d</span><span>by <small class="author">Thomas A. Edison</small></span><div class="tags">Tags: <a class="tag">edison</a> <a class="tag">failure</a> <a class="tag">inspirational</a> <a class="tag">paraphrased</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cA woman is like a tea bag; you never know how strong it is until it\'s in hot water.\xe2\x80\x9d</span><span>by <small class="author">Eleanor Roosevelt</small></span><div class="tags">Tags: <a class="tag">misattributed-eleanor-roosevelt</a></div></div><div class="quote"><span class="text">\xe2\x80\x9cA day without sunshine is like, you know, night.\xe2\x80\x9d</span><span>by <small class="author">Steve Martin</small></span><div class="tags">Tags: <a class="tag">humor</a> <a class="tag">obvious</a> <a class="tag">simile</a></div></div></div>\n </div>\n</div>\n<div id="loading" style="background-color: rgb(238, 238, 204); display: none;"><h5>Loading...</h5></div>\n<script src="/static/jquery.js"></script>\n<script>\n $(function(){\n var page = 1, tag = null, hasNextPage = true;\n function appendQuotes(quotes) {\n var $quotes = $(\'.quotes\');\n var html = $.map(quotes, function(d){\n var tags = $.map(d[\'tags\'], function(t) {\n return "<a class=\'tag\'>" + t + "</a>";\n }).join(" ");\n return "<div class=\'quote\'><span class=\'text\'>" + d[\'text\'] + "</span><span>by <small class=\'author\'>" + d[\'author\'][\'name\'] + "</small></span><div class=\'tags\'>Tags: " + tags + "</div></div>";\n });\n\n $quotes.append(html);\n }\n\n function updatePage(page) {\n $(\'#loading\').show(\'fast\');\n $.get(\'/api/quotes\', {page: page}).done(function(data) {\n appendQuotes(data.quotes);\n hasNextPage = data.has_next;\n $(\'#loading\').hide(\'fast\');\n });\n }\n updatePage(page);\n $(window).on(\'scroll\', function(){\n var scrollTop = $(window).scrollTop();\n var heightDiff = $(document).height() - $(window).height();\n if (hasNextPage && Math.abs(scrollTop - heightDiff) <= 1){\n page += 1;\n console.log(\'scrolling to page: \' + page);\n updatePage(page);\n }\n });\n });\n</script>\n\n </div>\n <footer class="footer">\n <div class="container">\n <p class="text-muted">\n Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>\n </p>\n <p class="copyright">\n Made with <span class="sh-red">\xe2\x9d\xa4</span> by <a href="https://scrapinghub.com">Scrapinghub</a>\n </p>\n </div>\n </footer>\n\n</body></html>'
2023-04-30 21:04:58 [splash_spdr] DEBUG: Saved file quotes-scroll.html
2023-04-30 21:04:58 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2023-04-30 21:04:58 [scrapy.core.engine] INFO: Closing spider (finished)
2023-04-30 21:04:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 751,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 6414,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 93.09357,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 4, 30, 12, 4, 58, 579719),
'httpcache/hit': 2,
'log_count/DEBUG': 8,
'log_count/INFO': 13,
'log_count/WARNING': 1,
'memusage/max': 83218432,
'memusage/startup': 79970304,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2023, 4, 30, 12, 3, 25, 486149)}
2023-04-30 21:04:58 [scrapy.core.engine] INFO: Spider closed (finished)
保存的文件:
<!DOCTYPE html><html lang="en"><head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quotes"><div class="quote"><span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div><div class="quote"><span class="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span><span>by <small class="author">J.K. Rowling</small></span><div class="tags">Tags: <a class="tag">abilities</a> <a class="tag">choices</a></div></div><div class="quote"><span class="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">inspirational</a> <a class="tag">life</a> <a class="tag">live</a> <a class="tag">miracle</a> <a class="tag">miracles</a></div></div><div class="quote"><span class="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span><span>by <small class="author">Jane Austen</small></span><div class="tags">Tags: <a class="tag">aliteracy</a> <a class="tag">books</a> <a class="tag">classic</a> <a class="tag">humor</a></div></div><div class="quote"><span class="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span><span>by <small class="author">Marilyn Monroe</small></span><div class="tags">Tags: <a class="tag">be-yourself</a> <a class="tag">inspirational</a></div></div><div class="quote"><span class="text">“Try not to become a man of success. Rather become a man of value.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">adulthood</a> <a class="tag">success</a> <a class="tag">value</a></div></div><div class="quote"><span class="text">“It is better to be hated for what you are than to be loved for what you are not.”</span><span>by <small class="author">André Gide</small></span><div class="tags">Tags: <a class="tag">life</a> <a class="tag">love</a></div></div><div class="quote"><span class="text">“I have not failed. I've just found 10,000 ways that won't work.”</span><span>by <small class="author">Thomas A. Edison</small></span><div class="tags">Tags: <a class="tag">edison</a> <a class="tag">failure</a> <a class="tag">inspirational</a> <a class="tag">paraphrased</a></div></div><div class="quote"><span class="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span><span>by <small class="author">Eleanor Roosevelt</small></span><div class="tags">Tags: <a class="tag">misattributed-eleanor-roosevelt</a></div></div><div class="quote"><span class="text">“A day without sunshine is like, you know, night.”</span><span>by <small class="author">Steve Martin</small></span><div class="tags">Tags: <a class="tag">humor</a> <a class="tag">obvious</a> <a class="tag">simile</a></div></div></div>
</div>
</div>
<div id="loading" style="background-color: rgb(238, 238, 204); display: none;"><h5>Loading...</h5></div>
<script src="/static/jquery.js"></script>
<script>
$(function(){
var page = 1, tag = null, hasNextPage = true;
function appendQuotes(quotes) {
var $quotes = $('.quotes');
var html = $.map(quotes, function(d){
var tags = $.map(d['tags'], function(t) {
return "<a class='tag'>" + t + "</a>";
}).join(" ");
return "<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>";
});
$quotes.append(html);
}
function updatePage(page) {
$('#loading').show('fast');
$.get('/api/quotes', {page: page}).done(function(data) {
appendQuotes(data.quotes);
hasNextPage = data.has_next;
$('#loading').hide('fast');
});
}
updatePage(page);
$(window).on('scroll', function(){
var scrollTop = $(window).scrollTop();
var heightDiff = $(document).height() - $(window).height();
if (hasNextPage && Math.abs(scrollTop - heightDiff) <= 1){
page += 1;
console.log('scrolling to page: ' + page);
updatePage(page);
}
});
});
</script>
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class="sh-red">❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a>
</p>
</div>
</footer>
</body></html>