我正在学习 Scrapy。比如有一个网站http://quotes.toscrape.com。 我正在创建一个简单的蜘蛛(scrapy genspider 引号)。 我想解析报价,以及转到作者的页面并解析他的出生日期。 我正在尝试这样做,但没有任何效果。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
quotes=response.xpath('//div[@class="quote"]')
item={}
for quote in quotes:
item['name']=quote.xpath('.//span[@class="text"]/text()').get()
item['author']=quote.xpath('.//small[@class="author"]/text()').get()
item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
url=quote.xpath('.//small[@class="author"]/../a/@href').get()
response.follow(url, self.parse_additional_page, item)
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page,self.parse)
def parse_additional_page(self, response, item):
item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
没有出生日期的代码(正确):
import scrapy
class QuotesSpiderSpider(scrapy.Spider):
name = "quotes_spider"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["https://quotes.toscrape.com/"]
def parse(self, response):
quotes=response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'name':quote.xpath('.//span[@class="text"]/text()').get(),
'author':quote.xpath('.//small[@class="author"]/text()').get(),
'tags':quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
}
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page,self.parse)
问题:每条引文如何进入作者页面并解析出生日期?
如何去每个引用的作者页面并解析出生日期?
Request
:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
quotes=response.xpath('//div[@class="quote"]')
item={}
for quote in quotes:
item['name'] = quote.xpath('.//span[@class="text"]/text()').get()
item['author'] = quote.xpath('.//small[@class="author"]/text()').get()
item['tags'] = quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
url = quote.xpath('.//small[@class="author"]/../a/@href').get()
# HERE
yield scrapy.Request(response.urljoin(url), self.parse_additional_page, meta={'item': item})
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page, self.parse)
def parse_additional_page(self, response): # HERE
item = response.meta['item'] # HERE
item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
输出:
[{'name': '“A day without sunshine is like, you know, night.”',
'author': 'Steve Martin',
'tags': ['humor', 'obvious', 'simile'],
'additional_data': 'November 22, 1869'},
{'name': '“A day without sunshine is like, you know, night.”',
'author': 'Steve Martin',
'tags': ['humor', 'obvious', 'simile'],
'additional_data': 'June 01, 1926'},
{'name': '“A day without sunshine is like, you know, night.”',
'author': 'Steve Martin',
'tags': ['humor', 'obvious', 'simile'],
'additional_data': 'October 11, 1884'}]
你实际上真的很接近正确。您只缺少几件东西,还有一件东西需要移动。
response.follow
返回一个请求对象,所以除非你 yield
那个请求对象永远不会从 scrapy 引擎发送。
将对象从一个回调函数传递到另一个回调函数时,您应该使用
cb_kwargs
参数。使用 meta
字典也可以,但 scrapy 官方更喜欢使用 cb_kwargs
。然而,简单地将它作为位置参数传递是行不通的。
a
dict
是可变的,这包括当它们被用作 scrapy 项目时。因此,当您创建 scrapy 项目时,每个单独的项目都应该是唯一的。否则,当您稍后去更新该项目时,您可能最终会改变以前产生的项目。
这是一个使用您的代码但实现了我上面提出的三点的示例。
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
# moving the item constructor inside the loop
# means it will be unique for each item
item={}
item['name']=quote.xpath('.//span[@class="text"]/text()').get()
item['author']=quote.xpath('.//small[@class="author"]/text()').get()
item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
url=quote.xpath('.//small[@class="author"]/../a/@href').get()
# you have to yield the request returned by response.follow
yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item})
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page)
def parse_additional_page(self, response, item=None):
item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
部分输出:
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/>
{'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/>
{'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/>
{'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': '
September 20, 1948'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/>
{'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}
查看Passing additional data to callback functions和
Response.follow
在scrapy docs中找到更多信息。