Scrapy keyerror和下一页url不工作

问题描述 投票:0回答:1

我正试图用这个页面作为起始网址进行搜刮。https:/www.imdb.comliststt0237478?ref_=tt_rls_sm这个页面有3个列表,其中一个列表有100多个项目,我的代码只搜刮了100个项目,没有从下一页获取数据。请检查代码有什么问题。

import scrapy
from urllib.parse import urljoin
class lisTopSpider(scrapy.Spider):
    name= 'ImdbListsSpider'
    allowed_domains = ['imdb.com']
    start_urls = [
        'https://www.imdb.com/lists/tt0237478'
    ]

    def parse(self, response):
        listsLinks = response.xpath('//div[2]/strong')
        for link in listsLinks:
            list_url = response.urljoin(link.xpath('.//a/@href').get())
            yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})

        next_page_url = response.xpath('//a[@class="flat-button next-page "]/@href').get()
        if next_page_url is not None:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(next_page_url, callback=self.parse)            

    def parse_list(self, response):
        list_url = response.meta['list_url']
        titles = response.xpath('//h3/a/@href').getall()

        next_page_url = response.xpath('//a[@class="flat-button lister-page-next next-page"]/@href').get()
        if next_page_url is not None:
            next_page_url = urljoin('https://www.imdb.com',next_page_url)
            print('here is next page url')
            print(next_page_url)
            yield scrapy.Request(next_page_url, callback=self.parse_list)  

        yield{
            'listurl': list_url,
            'titles': titles,
        }

以下是错误信息

2020-05-06 21:09:29 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.imdb.com/list/ls055923961/?page=2> (referer: https://www.imdb.com/list/ls055923961/)
Traceback (most recent call last):
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\utils\defer.py", line 117, in iter_errback
    yield next(it)
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\utils\python.py", line 345, in __next__   
    return next(self.data)
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\utils\python.py", line 345, in __next__   
    return next(self.data)
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 338, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, 
in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python projects\scrapy\imdb_project\virenv\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Python Projects\Scrapy\imdb_project\imdb_project\spiders\TopLists.py", line 29, in parse_list
    list_url = response.meta['list_url']
KeyError: 'list_url'
scrapy keyerror
1个回答
3
投票

你正在使用 Request.meta 以提供 list_url 中的parse_list-method。parse-方法,但你忘了在下一页的parse_list里面的Request-call中使用它。只需在你的Request-call中添加 meta={'list_url': list_url} 对你的 Request 里面 parse_list 就可以正常工作了。

所以parse_list中对下一页的处理应该是这样的。

if next_page_url is not None:
    next_page_url = urljoin('https://www.imdb.com', next_page_url)
    yield scrapy.Request(next_page_url, callback=self.parse_list, meta={'list_url': list_url})

顺便说一下,在Scrapy 1.7之后,处理用户信息的首选方式是: Request.cb_kwargs (见 "注意事项"------------------------------------------------官方文档中的部分)

© www.soinside.com 2019 - 2024. All rights reserved.