始终有效的Python Scrapy 函数

问题描述 投票:0回答:1

下面的脚本在 90% 的时间内都可以收集天气数据。然而,在极少数情况下,由于某种原因它会失败并且 html 代码与其他请求一致。有时,代码相同且请求相同,但会失败。

class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
# start_urls = ['http://nflweather.com/']

def __init__(self, Week='', Year='', Game='', **kwargs):
    self.start_urls = [f'https://nflweather.com/{Week}/{Year}/{Game}']  # py36
    self.Year = Year
    self.Game = Game
    super().__init__(**kwargs)
    print(self.start_urls)  # python3

def parse(self, response):
    self.log(self.start_urls)
    #self.log(self.Year)
    # pass
    # Extracting the content using css selectors
    # Extracting the content using css selectors
    game_boxes = response.css('div.game-box')

    for game_box in game_boxes:
        # Extracting date and time information
        Datetimes = game_box.css('.col-12 .fw-bold::text').get()

        # Extracting team information
        team_game_boxes = game_box.css('.team-game-box')
        awayTeams = team_game_boxes.css('.fw-bold::text').get()
        homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
        # Extracting temperature and probability information
        TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()

        # Extracting wind speed information
        windspeeds = game_box.css('.icon-weather + span::text').get()
        winddirection = game_box.css('.md-18 ::text').get()
        # Create a dictionary to store the scraped info
        scraped_info = {
            'Year': self.Year,
            'Game': self.Game,
            'Datetime': Datetimes.strip(),
            'awayTeam': awayTeams,
            'homeTeam': homeTeams,
            'TempProb': TempProbs,
            'windspeeds': windspeeds.strip(),
            'winddirection': winddirection.strip()
        }

        # Yield or give the scraped info to Scrapy
        yield scraped_info

这些是运行爬虫的scrapy命令

scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-6 -o NFLWeather_2012_week_6.json   
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-7 -o NFLWeather_2012_week_7.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-8 -o NFLWeather_2012_week_8.json

第 6 周爬行工作完美,没有任何问题

第 7 周抓取没有返回任何内容

ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-7> (referer: None)
Traceback (most recent call last):
  File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
    yield next(it)

第 8 周检索 2 行并从其余行中找出错误

ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-8> (referer: None)
Traceback (most recent call last):
  File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
    yield next(it)

知道为什么这些文件失败而其他文件没有问题吗?

python scrapy
1个回答
2
投票

错误在于

windspeeds
变量,有时天气数据丢失,因此
windspeeds
变量将是
None
,然后当您尝试创建字典对象时,您会调用
windspeeds.strip()
抛出异常。

您可以通过在创建字典时进行简单的

None
检查来解决此问题,或者您可以提前进行检查,但最适合您的需求。但这是一个有效的例子

scraped_info = {
   'Year': self.Year,
   'Game': self.Game,
   'Datetime': Datetimes.strip(),
   'awayTeam': awayTeams,
   'homeTeam': homeTeams,
   'TempProb': TempProbs,
   'windspeeds': windspeeds.strip() if windspeeds is not None else "TBD",
   'winddirection': winddirection.strip() if winddirection is not None else "TBD"
}

您还会注意到,您通过

week-6
提供的“工作”示例现在将包含比以前更多的结果

© www.soinside.com 2019 - 2024. All rights reserved.