尝试遵循本教程。但是当我要部署我的第一个蜘蛛时,item_scraped_count 中缺少状态。
当我执行 fetch、response 等命令时,我得到数据。
len(books) 应该显示 20。
还有其他方法可以检查蜘蛛是否获取了正确数量的对象吗?
我需要在 settings.py 或其他文件中修复什么?
来自终端的消息:
Save New Duplicate & Edit Just Text Twitter
In [3]: fetch('https://books.toscrape.com')
2023-05-06 21:14:45 [asyncio] DEBUG: Using selector: SelectSelector
2023-05-06 21:14:45 [scrapy.core.engine] INFO: Spider opened
2023-05-06 21:14:45 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://books.toscrape.com/robots.txt> (referer: None)
2023-05-06 21:14:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com> (referer: None)
In [4]: response
Out[4]: <200 https://books.toscrape.com>
In [5]: response.css('article.product_pod')
2023-05-06 21:15:31 [py.warnings] WARNING: C:\Users\****\****\****\venv\Lib\site-packages\scrapy\selector\unified.py:83: UserWarning: Selector got both text and root, root is being ignored.
super().__init__(text=text, type=st, root=root, **kwargs)
Out[5]:
[<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>,
<Selector query="descendant-or-self::article[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_pod ')]" data='<article class="product_pod">\n ...'>]
In [6]: response.css('article.product_pod').get()
Out[6]: '<article class="product_pod">\n \n <div class="image_container">\n \n \n <a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail"></a>\n \n \n </div>\n \n\n \n \n <p class="star-rating Three">\n <i class="icon-star"></i>\n <i class="icon-star"></i>\n <i class="icon-star"></i>\n <i class="icon-star"></i>\n <i class="icon-star"></i>\n </p>\n \n \n\n \n <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>\n \n\n \n <div class="product_price">\n \n\n\n\n\n\n\n \n <p class="price_color">£51.77</p>\n \n\n<p class="instock availability">\n <i class="icon-ok"></i>\n \n In stock\n \n</p>\n\n
\n \n\n\n\n\n\n\n \n <form>\n <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>\n </form>\n\n\n \n </div>\n \n </article>'
In [7]: books = response.css('article.product_pod')
In [8]: len(books)
Out[8]: 20
这个警告是什么意思?
In [5]: response.css('article.product_pod')
2023-05-06 21:15:31 [py.warnings] WARNING: C:\Users\****\****\****\venv\Lib\site-packages\scrapy\selector\unified.py:83: UserWarning: Selector got both text and root, root is being ignored.
super().__init__(text=text, type=st, root=root, **kwargs)
蜘蛛:
import scrapy
class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["http://books.toscrape.com/"]
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name' : book.css('h3 a::text').get(),
'price': book.css('.product_price .price_color::text').get(),
'url' : book.css('h3 a').attrib['href'],
}
部署蜘蛛时的终端:
(venv) PS C:\Users\****\****\****\venv\bookscraper> scrapy crawl bookspider
2023-05-06 21:55:31 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: bookscraper)
2023-05-06 21:55:31 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.11.3 (tags/v3.11.3:f3909b8, Apr 4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Windows-10-10.0.22621-SP0
2023-05-06 21:55:31 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'bookscraper',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'bookscraper.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['bookscraper.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-05-06 21:55:31 [asyncio] DEBUG: Using selector: SelectSelector
2023-05-06 21:55:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-05-06 21:55:31 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-05-06 21:55:31 [scrapy.extensions.telnet] INFO: Telnet Password: c24ceb1a81dfd3bc
2023-05-06 21:55:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2023-05-06 21:55:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 51751,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 0.670267,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 5, 6, 19, 55, 32, 971259),
'log_count/DEBUG': 5,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2023, 5, 6, 19, 55, 32, 300992)}
2023-05-06 21:55:32 [scrapy.core.engine] INFO: Spider closed (finished)
我试过重命名变量。试图看看我是否可以在 settings.py 中找到任何东西,或者我是否可以找到一些明显的东西。