网页抓取元素:输出问题

问题描述 投票:0回答:1

我正在尝试从 IMBD 网站上抓取集数、季数、执行时间、原籍国和语言。 这是我使用的代码。

import requests, json
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.imdb.com/chart/toptv/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}

soup = BeautifulSoup(requests.get(url,headers=headers).content,"html.parser")

def extract_additional_info(link):
    additional_info = {}
    soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
    
    # Number episodes
    episode_element = soup.select_one('[data-testid="episodes-header"]')
    if episode_element:
        additional_info['Numero episodi'] = episode_element.get_text().split('Episodes')[1]
        print(additional_info['Numero episodi'])
    else:
        additional_info['Numero episodi'] = None
            
    # Number seasons
    season_select = soup.find('select', id='browse-episodes-season')
    if season_select:
      aria_label = season_select['aria-label']
      additional_info['Numero stagioni'] = aria_label.split()[0] if aria_label else None
    # 1 season
    season_element = soup.select_one('[data-testid="episodes-browse-episodes"]')
    if season_element:
      stagione = season_element.find('span', class_='ipc-btn__text').get_text()
      additional_info['Numero stagioni'] = stagione.split()[0] if stagione else None
    print(additional_info['Numero stagioni'])

    # Duration episode
    additional_info['Durata per episodio'] = soup.select_one('[data-testid="title-techspec_runtime"]').get_text().split('Runtime')[1]
    additional_info['Durata per episodio']= additional_info['Durata per episodio'].split('minutes')[0]
    print(additional_info['Durata per episodio'])
    
    # Contry origin
    additional_info['Paese origine'] = soup.select_one('[data-testid="title-details-origin"]').get_text().split('of origin')[1]
    print(additional_info['Paese origine'])

    # Original language
    additional_info['Lingua originale'] = soup.select_one('[data-testid="title-details-languages"]').get_text()
    print(additional_info['Lingua originale'])

    return additional_info

data = []

for link in soup.select('a[href^="/title"]:has(h3)')[:25]:
    serie_info = json.loads(
        BeautifulSoup(
            requests.get(f'https://www.imdb.com{link.get("href")}', headers=headers).content,
            "html.parser"
        ).find("script", {"type": "application/ld+json"}).text
    )
    
    additional_info = extract_additional_info(f'https://www.imdb.com{link.get("href")}')
    if additional_info is not None:
        serie_info.update(additional_info)
    
    data.append(serie_info)

dati = pd.json_normalize(data)

我在抓取季数时遇到问题,因为如果电视节目只有一季,则此信息包含在不同的路径中。所以我尝试使用“if”,但它不起作用,并且找不到文本“1 season”而是“TopTop- rating”。

此信息的网址是:

<div class="sc-e1ed1839-0 fRlQPg episodes-browse-episodes" data-testid="episodes-browse-
episodes"><div class="sc-e1ed1839-1 ihQgyV"><font style="vertical-align: inherit;"><font 
style="vertical-align: inherit;">Browse episodes</font></font></div><div class="sc-e1ed1839-4 
jngMUB"><a class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-
height ipc-btn--core-base ipc-btn--theme-base ipc-btn--on-accent2 ipc-text-button" role="button"
tabindex="0" aria-disabled="false" href="/title/tt5491994/episodes/?
topRated=DESC&amp;ref_=tt_eps_rhs_sm"><span class="ipc-btn__text"><span class="sc-e1ed1839-3 
hWiAfx"><font style="vertical-align: inherit;"><font style="vertical-align: 
inherit;">Start</font></font></span><span class="sc-e1ed1839-2 cloUKk"><font style="vertical-
align: inherit;"><font style="vertical-align: inherit;">Most voted</font></font></span></span>
</a><a class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-
height ipc-btn--core-base ipc-btn--theme-base ipc-btn--on-accent2 ipc-text-button" role="button"
 tabindex="0" aria-disabled="false" href="/title/tt5491994/episodes?
season=1&amp;ref_=tt_eps_sn_1"><span class="ipc-btn__text"><font style="vertical-align: 
inherit;"><font style="vertical-align: inherit;">1 Season</font></font></span></a><span 
class="ipc-simple-select__container"><span class="ipc-simple-select ipc-simple-select--base ipc-
simple-select--on-accent2"><label for="browse-episodes-year" class="ipc-simple-select__label">
<font style="vertical-align: inherit;"><font style="vertical-align: inherit;">2 years</font>
</font></label><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" class="ipc-icon 
ipc-icon--arrow-drop-down ipc-simple-select__icon--post" viewBox="0 0 24 24" fill="currentColor"
 role="presentation"><path fill="none" d="M0 0h24v24H0V0z"></path><path d="M8.71 11.71l2.59 
2.59c.39.39 1.02.39 1.41 0l2.59-2.59c.63-.63.18-1.71-.71-1.71H9.41c-.89 0-1.33 1.08-.7 1.71z">
</path></svg><select id="browse-episodes-year" aria-label="2 years" class="ipc-simple-
select__input"><option selected="" value=""></option><option value="2017">2017</option><option 
value="2016"><font style="vertical-align: inherit;"><font style="vertical-align: 
inherit;">2016</font></font></option><option value="SEE_ALL"><font style="vertical-align: 
inherit;"><font style="vertical-align: inherit;">View all</font></font></option></select></span>
</span></div></div>

下一个问题是输出是这样的:

62
TopTop-rated
45 
United States
LanguagesEnglishSpanish
6
TopTop-rated
50 
United KingdomGermanyFranceChinaUnited States
LanguagesEnglishFrench
11
TopTop-rated
49 
United KingdomCanadaUnited StatesJapan
LanguageEnglish
10
TopTop-rated
59 
United KingdomUnited States
LanguagesEnglishDutchFrenchGermanLithuanian
5
TopTop-rated
41 
United StatesUnited Kingdom
LanguagesEnglishRussianUkrainian
60
TopTop-rated
1 hour
United States
LanguagesEnglishGreekMandarinSpanish

因此,如果有多个元素,则国家和语言不会分开,并且我不知道在这种情况下如何使用“拆分”。我还想在语言之前消除语言。

最后,我是否应该将执行时间从“1小时”转换为“60分钟”,这样我的分析会更容易?如果是,我该怎么做? 谢谢

python web-scraping beautifulsoup
1个回答
0
投票

我不明白你的第一个问题是什么,但这是你的第二个问题的答案。 语言和国家/地区被连接起来,因为您在整个元素上调用

get_text
,因此它将所有文本组件粘合在一起。为了获得单一国家/地区,请使用以下行:

    additional_info['Paese origine'] = soup.select_one('[data-testid="title-details-origin"]').find('a').get_text()

即导航到第一个

a
子元素并获取其文本。 为了获取所有语言,找到所有
a
子元素并获取它们的文本:

    additional_info['Lingua originale'] = [e.get_text() for e in soup.select_one('[data-testid="title-details-languages"]').find_all('a')]

© www.soinside.com 2019 - 2024. All rights reserved.