下面的脚本在 90% 的时间内都可以收集天气数据。然而,在极少数情况下,由于某种原因它会失败并且 html 代码与其他请求一致。有时,代码相同且请求相同,但会失败。
class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
# start_urls = ['http://nflweather.com/']
def __init__(self, Week='', Year='', Game='', **kwargs):
self.start_urls = [f'https://nflweather.com/{Week}/{Year}/{Game}'] # py36
self.Year = Year
self.Game = Game
super().__init__(**kwargs)
print(self.start_urls) # python3
def parse(self, response):
self.log(self.start_urls)
#self.log(self.Year)
# pass
# Extracting the content using css selectors
# Extracting the content using css selectors
game_boxes = response.css('div.game-box')
for game_box in game_boxes:
# Extracting date and time information
Datetimes = game_box.css('.col-12 .fw-bold::text').get()
# Extracting team information
team_game_boxes = game_box.css('.team-game-box')
awayTeams = team_game_boxes.css('.fw-bold::text').get()
homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
# Extracting temperature and probability information
TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()
# Extracting wind speed information
windspeeds = game_box.css('.icon-weather + span::text').get()
winddirection = game_box.css('.md-18 ::text').get()
# Create a dictionary to store the scraped info
scraped_info = {
'Year': self.Year,
'Game': self.Game,
'Datetime': Datetimes.strip(),
'awayTeam': awayTeams,
'homeTeam': homeTeams,
'TempProb': TempProbs,
'windspeeds': windspeeds.strip(),
'winddirection': winddirection.strip()
}
# Yield or give the scraped info to Scrapy
yield scraped_info
这些是运行爬虫的scrapy命令
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-6 -o NFLWeather_2012_week_6.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-7 -o NFLWeather_2012_week_7.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-8 -o NFLWeather_2012_week_8.json
第 6 周爬行工作完美,没有任何问题
第 7 周抓取没有返回任何内容
ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-7> (referer: None)
Traceback (most recent call last):
File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
yield next(it)
第 8 周检索 2 行并从其余行中找出错误
ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-8> (referer: None)
Traceback (most recent call last):
File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
yield next(it)
知道为什么这些文件失败而其他文件没有问题吗?
错误在于
windspeeds
变量,有时天气数据丢失,因此 windspeeds
变量将是 None
,然后当您尝试创建字典对象时,您会调用 windspeeds.strip()
抛出异常。
您可以通过在创建字典时进行简单的
None
检查来解决此问题,或者您可以提前进行检查,但最适合您的需求。但这是一个有效的例子
scraped_info = {
'Year': self.Year,
'Game': self.Game,
'Datetime': Datetimes.strip(),
'awayTeam': awayTeams,
'homeTeam': homeTeams,
'TempProb': TempProbs,
'windspeeds': windspeeds.strip() if windspeeds is not None else "TBD",
'winddirection': winddirection.strip() if winddirection is not None else "TBD"
}
您还会注意到,您通过
week-6
提供的“工作”示例现在将包含比以前更多的结果