获取 `scrapy` 来生成嵌套数据结构

问题描述 投票:0回答:1

我正在使用

scrapy
抓取 this 网站并抓取数据

我希望抓取的数据具有嵌套结构。像这样的东西

{
   denomination: {
      date: {
        bondNumbers: [...]
      }
   }
}

这是我写的

spider

import scrapy

class Savings(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/',
    ]

    def parse(self, response):

        for option in response.css('select option'):
            denomination = option.css('::text').get()
            url          = option.css('::attr(value)').get()
            yield {
                denomination: response.follow(url, self.parseDrawList)
            }


    def parseDrawList(self, response):

        for a in response.css('select option'):            
            date = a.css('::text').get()
            url  = a.css('::attr(value)').get()
            yield {
                date: response.follow(url, self.parseDraw)
            }


    def parseDraw(self, response):
        yield {
            'bondNumbers': response.selector.re(r'\d{6}'),
        }

每个函数都在网页层次结构中抓取不同的页面(如果我们可以这样称呼它),因此嵌套数据结构的每个级别都将由来自不同级别页面的数据填充。

此代码无法正常工作,并且给我一个错误。

从我见过的所有教程和文档中,没有人使用过

scrapy
来生成嵌套数据结构。

有什么方法可以从

scrapy
获取嵌套数据吗?我还希望该解决方案不会牺牲
scrapy
的请求并发执行

python web-scraping scrapy web-crawler generator
1个回答
0
投票

您需要从每个回调获取信息,并使用请求元字典或

response.follow
中的 cb_kwargs 参数将其传递给下一个回调,然后在最终回调中,您可以构造完全嵌套的结构并将其作为项目。

例如:

import scrapy

class Savings(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/',
    ]

    def parse(self, response):
        for option in response.css('select option'):
            denomination = option.css('::text').get()
            url          = option.css('::attr(value)').get()
            yield response.follow(url, self.parseDrawList, cb_kwargs={'denomination': denomination})

    def parseDrawList(self, response, denomination=None):
        for a in response.css('tr td a'):
            date = a.css('::text').get()
            url  = a.css('::attr(href)').get()
            yield response.follow(url, self.parseDraw, cb_kwargs={'denomination': denomination, "date": date})

    def parseDraw(self, response, denomination=None, date=None):
        yield {
            denomination: {
                date: {
                    'bondNumbers': response.selector.re(r'\d{6}')
                }
            }
        }

输出示例

2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/2017/03/16-02-2015Rs.1500.txt>
{'Rs. 1500/- Draws': {'16-02-2015': {'bondNumbers': ['749492', '457346', '692793', '914362', '000535', ...]}}}
2023-08-29 15:40:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt> from <GET http://savings.gov.pk/wp-content/uploads/10-03-2
021-Rs-25000-Premium.txt>
2023-08-29 15:40:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://savings.gov.pk:443/wp-content/uploads/10-12-2021-Rs-25000-Premium.txt> (referer: None)
2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/10-12-2021-Rs-25000-Premium.txt>
{'Rs 25000/- Premium Bonds Draws': {'10-12-2021': {'bondNumbers': ['016253', '067408', '038203', '171265', '551833', '655804', '916353', '001858', '064668', '149237', '220908', '293362', '361338', '447697', '512113', '610773', ... ]}}}
2023-08-29 15:40:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt> (referer: None)
2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt>
© www.soinside.com 2019 - 2024. All rights reserved.