我正在使用
scrapy
抓取 this 网站并抓取数据
我希望抓取的数据具有嵌套结构。像这样的东西
{
denomination: {
date: {
bondNumbers: [...]
}
}
}
这是我写的
spider
。
import scrapy
class Savings(scrapy.Spider):
name = 'savings'
start_urls = [
'https://savings.gov.pk/download-draws/',
]
def parse(self, response):
for option in response.css('select option'):
denomination = option.css('::text').get()
url = option.css('::attr(value)').get()
yield {
denomination: response.follow(url, self.parseDrawList)
}
def parseDrawList(self, response):
for a in response.css('select option'):
date = a.css('::text').get()
url = a.css('::attr(value)').get()
yield {
date: response.follow(url, self.parseDraw)
}
def parseDraw(self, response):
yield {
'bondNumbers': response.selector.re(r'\d{6}'),
}
每个函数都在网页层次结构中抓取不同的页面(如果我们可以这样称呼它),因此嵌套数据结构的每个级别都将由来自不同级别页面的数据填充。
此代码无法正常工作,并且给我一个错误。
从我见过的所有教程和文档中,没有人使用过
scrapy
来生成嵌套数据结构。
有什么方法可以从
scrapy
获取嵌套数据吗?我还希望该解决方案不会牺牲 scrapy
的请求并发执行
您需要从每个回调获取信息,并使用请求元字典或
response.follow
中的 cb_kwargs 参数将其传递给下一个回调,然后在最终回调中,您可以构造完全嵌套的结构并将其作为项目。
例如:
import scrapy
class Savings(scrapy.Spider):
name = 'savings'
start_urls = [
'https://savings.gov.pk/download-draws/',
]
def parse(self, response):
for option in response.css('select option'):
denomination = option.css('::text').get()
url = option.css('::attr(value)').get()
yield response.follow(url, self.parseDrawList, cb_kwargs={'denomination': denomination})
def parseDrawList(self, response, denomination=None):
for a in response.css('tr td a'):
date = a.css('::text').get()
url = a.css('::attr(href)').get()
yield response.follow(url, self.parseDraw, cb_kwargs={'denomination': denomination, "date": date})
def parseDraw(self, response, denomination=None, date=None):
yield {
denomination: {
date: {
'bondNumbers': response.selector.re(r'\d{6}')
}
}
}
输出示例
2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/2017/03/16-02-2015Rs.1500.txt>
{'Rs. 1500/- Draws': {'16-02-2015': {'bondNumbers': ['749492', '457346', '692793', '914362', '000535', ...]}}}
2023-08-29 15:40:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt> from <GET http://savings.gov.pk/wp-content/uploads/10-03-2
021-Rs-25000-Premium.txt>
2023-08-29 15:40:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://savings.gov.pk:443/wp-content/uploads/10-12-2021-Rs-25000-Premium.txt> (referer: None)
2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/10-12-2021-Rs-25000-Premium.txt>
{'Rs 25000/- Premium Bonds Draws': {'10-12-2021': {'bondNumbers': ['016253', '067408', '038203', '171265', '551833', '655804', '916353', '001858', '064668', '149237', '220908', '293362', '361338', '447697', '512113', '610773', ... ]}}}
2023-08-29 15:40:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt> (referer: None)
2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt>