Scrapy 条件 HTML 值

问题描述 投票:0回答:1

下面的代码找到了我正在寻找的大部分元素。然而,温度和风速的标签会根据天气严重程度而变化。如何让下面的代码在页面上一致地获得正确的 TempProb 和风速值。

import scrapy

class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
start_urls = ['http://nflweather.com/']

def parse(self, response):
    # pass
    # Extracting the content using css selectors
    Datetimes = response.xpath(
        '//div[@class="fw-bold text-wrap"]/text()').extract()
    awayTeams = response.xpath('//span[@class="fw-bold"]/text()').extract()
    homeTeams = response.xpath(
        '//span[@class="fw-bold ms-1"]/text()').extract()
    TempProbs = response.xpath(
        '//div[@class="mx-2"]/span/text()').extract()
    windspeeds = response.xpath(
        '//div[@class="text-break col-md-4 mb-1 px-1 flex-centered"]/span/text()').extract()
    # winddirection =

    # Give the extracted content row wise
    for item in zip(Datetimes, awayTeams, homeTeams, TempProbs, windspeeds):
        # create a dictionary to store the scraped info
        scraped_info = {
            'Datetime': item[0],
            'awayTeam': item[1],
            'homeTeam': item[2],
            'TempProb': item[3],
            'windspeeds': item[4]
        }

        # yield or give the scraped info to scrapy
        yield scraped_info
python scrapy
1个回答
0
投票

当然!下面是修改后的Scrapy代码。我引入了一些更改,以使温度、概率和风速的提取更加一致。此外,我还添加了解释代码每个部分的注释:

import scrapy

class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
start_urls = ['http://nflweather.com/']

def parse(self, response):
    # Extracting the content using css selectors
    game_boxes = response.css('div.game-box')

    for game_box in game_boxes:
        # Extracting date and time information
        Datetimes = game_box.css('.col-12 .fw-bold::text').get()

        # Extracting team information
        team_game_boxes = game_box.css('.team-game-box')
        awayTeams = team_game_boxes.css('.fw-bold::text').get()
        homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
        # Extracting temperature and probability information
        TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()

        # Extracting wind speed information
        windspeeds = game_box.css('.col-md-4.mb-1 .text-danger::text').get()

        # Create a dictionary to store the scraped info
        scraped_info = {
            'Datetime': Datetimes,
            'awayTeam': awayTeams,
            'homeTeam': homeTeams,
            'TempProb': TempProbs,
            'windspeeds': windspeeds
        }

        # Yield or give the scraped info to Scrapy
        yield scraped_info

我修改了团队信息的选择器,使它们更加具体。我没有使用通用的团队名称选择器,而是使用特定的索引 (:nth-child()) 来定位游戏框中适当的团队元素。

对于温度和概率,我保留选择器原样,假设根据您更新的 HTML 片段它仍然有效。如果结构发生变化,您可能需要修改此选择器。

对于风速,我修改了选择器,以使用相关 div 中的“text-danger”类来定位适当的跨度。这应该会使提取更加一致。

© www.soinside.com 2019 - 2024. All rights reserved.