python中的Scrapycrawlspider发现错误“'str'对象没有属性'iter'”

问题描述 投票:0回答:1

我遇到了网络抓取错误,但我不明白。我把这个代码贴了3天多了。有人可以帮我指导这个问题吗?

这是我的错误消息

2024-03-15 14:01:18 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: au_SQL)
2024-03-15 14:01:18 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 23.10.0, Python 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.0.0 (OpenSSL 3.0.13 30 Jan 2024), cryptography 42.0.2, Platform Windows-10-10.0.22631-SP0
2024-03-15 14:01:18 [scrapy.addons] INFO: Enabled addons:
[]
2024-03-15 14:01:18 [asyncio] DEBUG: Using selector: SelectSelector
2024-03-15 14:01:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-03-15 14:01:18 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-03-15 14:01:18 [scrapy.extensions.telnet] INFO: Telnet Password: a2cf6e5dcb32ce55
2024-03-15 14:01:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-03-15 14:01:18 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'au_SQL',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'au_SQL.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['au_SQL.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-03-15 14:01:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-03-15 14:01:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-03-15 14:01:18 [scrapy.middleware] INFO: Enabled item pipelines:
['au_SQL.pipelines.SQLlitePipeline']
2024-03-15 14:01:18 [scrapy.core.engine] INFO: Spider opened
2024-03-15 14:01:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-15 14:01:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-03-15 14:01:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.auct.co.th/robots.txt> (referer: None)
2024-03-15 14:01:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.auct.co.th/products> (referer: None)
2024-03-15 14:01:19 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.auct.co.th/products> (referer: None)
Traceback (most recent call last):
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\utils\defer.py", line 295, in aiter_errback
    yield await it.__anext__()
          ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\utils\python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\utils\python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\utils\asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\utils\python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\utils\python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\utils\asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\core\spidermw.py", line 118, in process_async
    async for r in iterable:
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 31, in process_spider_output_async
    async for r in result or ():
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\core\spidermw.py", line 118, in process_async
    async for r in iterable:
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\spidermiddlewares\referer.py", line 355, in process_spider_output_async
    async for r in result or ():
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\core\spidermw.py", line 118, in process_async
    async for r in iterable:
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 30, in process_spider_output_async
    async for r in result or ():
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\core\spidermw.py", line 118, in process_async
    async for r in iterable:
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\spidermiddlewares\depth.py", line 35, in process_spider_output_async
    async for r in result or ():
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\core\spidermw.py", line 118, in process_async
    async for r in iterable:
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\spiders\crawl.py", line 128, in _parse_response
    for request_or_item in self._requests_to_follow(response):
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\spiders\crawl.py", line 98, in _requests_to_follow
    for lnk in rule.link_extractor.extract_links(response)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 248, in extract_links
    links = self._extract_links(doc, response.url, response.encoding, base_url)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 227, in _extract_links
    return self.link_extractor._extract_links(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 82, in _extract_links
    for el, attr, attr_val in self._iter_links(selector.root):
  File "C:\Users\ASUS\anaconda3\envs\virtual\Lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 70, in _iter_links
    for el in document.iter(etree.Element):
              ^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'iter'
2024-03-15 14:01:19 [scrapy.core.engine] INFO: Closing spider (finished)
2024-03-15 14:01:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 456,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 25030,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.408174,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 3, 15, 7, 1, 19, 158718, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 96132,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 5,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/AttributeError': 1,
 'start_time': datetime.datetime(2024, 3, 15, 7, 1, 18, 750544, tzinfo=datetime.timezone.utc)}

某行显示“‘str’对象没有属性‘iter’”。我不确定它是否表示给定的数据类型错误。

我的代码在这里

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class AuctionSpider(CrawlSpider):
    name = "auction"
    allowed_domains = ["auct.co.th"]
    start_urls = ["https://www.auct.co.th/products"]

    rules = (Rule(LinkExtractor(restrict_xpaths="//div[@class='p-2 card']/text()"), callback="parse_item", follow=True),)

    def parse_item(self, response):
        yield {
            'auction_date': response.xpath("//b[@id ='product_auction_date']/text()").get(),
            'price_start': response.xpath("//b[@id ='product_price_start']/text()").get(),
            'order': response.xpath("//b[@id ='product_order']/b/text()").get(),
            'product_title': response.xpath("//div[@class ='col-md-12']/b/text()").get(),
            'product_regis_id': response.xpath("//div[@class ='col-sm-12 col-md-12 col-xl-12']/b/text()").get(),
            'total_drive': response.xpath("//b[@id='product_total_drive']/text()").get(),
            'product_gear': response.xpath("//b[@id='product_gear']/text()").get(),
            'product_color': response.xpath("//b[@id='product_color']/text()").get(),
            'cc': response.xpath("//b[@id='product_engin_cc']/text()").get(),
            'regis_year': response.xpath("//b[@id='product_regis_year']/text()").get(),
            'build_year': response.xpath("//b[@id='product_build_year']/text()").get(),
            'gas_type': response.xpath("//b[@id='product_gas_type']/text()").get(),
            'vin_no': response.xpath("//b[@id='product_body_number']/text()").get(),
            'engine_no': response.xpath("//b[@id='product_engin_number']/text()").get(),
            'endtax': response.xpath("//b[@id='product_endtax']/text()").get(),
            'stock': response.xpath("//b[@id='product_oderstock']/text()").get(),
            'price': response.xpath("//b[@id='product_price_other']/text()").get(),
            'gadget': response.xpath("//b[@id='product_gadget']/text()").get(),
            'remark': response.xpath("//b[@id='product_remark']/text()").get(),
        #item["name"] = response.xpath('//div[@id="name"]').get()
        #item["description"] = response.xpath('//div[@id="description"]').get(
        }

我等待支持

非常感谢你

python python-3.x visual-studio-code web-scraping scrapy
1个回答
0
投票

发生这种情况是因为您专门搜索了一个字符串:(

/text()
)

restrict_xpaths="//div[@class='p-2 card']/text()"

您需要将其替换为包含链接的实际标签的 xpath 选择器,例如:

rules = (Rule(LinkExtractor(restrict_xpaths="//div[@class='p-2 card']//a"), callback="parse_item", follow=True),)

由于某种原因我没有产品,所以我得到的输出是:

{'auction_date': None, 'price_start': None, 'order': None, 'product_title': '-', 'product_regis_id': '-', 'total_drive': None, 'product_gear': None, 'product_color': None, 'cc': None, 'regis_year': None, 'build_year': None, 'gas_type': None, 'vin_no': None, 'engine_no': None, 'endtax': None, 'stock': None, 'price': None, 'gadget': None, 'remark': None}
© www.soinside.com 2019 - 2024. All rights reserved.