我看到这个答案:TypeError: set_user_agent() takes 2 positional arguments but 3 were given for my problem 但我不明白如何在我的代码中使用这个答案。
导入刮擦 从 scrapy.linkextractors 导入 LinkExtractor 从 scrapy.spiders 导入 CrawlSpider,Rule
BestMoviesSpider 类(CrawlSpider): name = 'best_movies' allowed_domains = ['imdb.com']
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..'
def start_requests(self):
yield scrapy.Request(url='https://www.imdb.com/search/title/?groups=top_250&sort=user_rating', headers={
'User-Agent': self.user_agent
})
rules = (
Rule(LinkExtractor(restrict_xpaths="//h3[@class='lister-item-header']/a"), callback='parse_item', follow=True, process_request='set_user_agent'),
Rule(LinkExtractor(restrict_xpaths="(//a[@class='lister-page-next next-page'])[2]"), process_request='set_user_agent')
)
def set_user_agent(self, request):
request.headers['User-Agent'] = self.user_agent
return request
def parse_item(self, response):
yield {
'title': response.xpath("//div[@class='title_wrapper']/h1/text()").get(),
'year': response.xpath("//span[@id='titleYear']/a/text()").get(),
'duration': response.xpath("normalize-space((//time)[1]/text())").get(),
'genre': response.xpath("//div[@class='subtext']/a[1]/text()").get(),
'rating': response.xpath("//span[@itemprop='ratingValue']/text()").get(),
'movie_url': response.url,
'user_agent': response.request.headers['User_Agent']
}
错误:set_user_agent() 接受 2 个位置参数,但给出了 3 个