我正在尝试从 IMBD 网站上抓取集数、季数、执行时间、原籍国和语言。 这是我使用的代码。
import requests, json
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.imdb.com/chart/toptv/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}
soup = BeautifulSoup(requests.get(url,headers=headers).content,"html.parser")
def extract_additional_info(link):
additional_info = {}
soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
# Number episodes
episode_element = soup.select_one('[data-testid="episodes-header"]')
if episode_element:
additional_info['Numero episodi'] = episode_element.get_text().split('Episodes')[1]
print(additional_info['Numero episodi'])
else:
additional_info['Numero episodi'] = None
# Number seasons
season_select = soup.find('select', id='browse-episodes-season')
if season_select:
aria_label = season_select['aria-label']
additional_info['Numero stagioni'] = aria_label.split()[0] if aria_label else None
# 1 season
season_element = soup.select_one('[data-testid="episodes-browse-episodes"]')
if season_element:
stagione = season_element.find('span', class_='ipc-btn__text').get_text()
additional_info['Numero stagioni'] = stagione.split()[0] if stagione else None
print(additional_info['Numero stagioni'])
# Duration episode
additional_info['Durata per episodio'] = soup.select_one('[data-testid="title-techspec_runtime"]').get_text().split('Runtime')[1]
additional_info['Durata per episodio']= additional_info['Durata per episodio'].split('minutes')[0]
print(additional_info['Durata per episodio'])
# Contry origin
additional_info['Paese origine'] = soup.select_one('[data-testid="title-details-origin"]').get_text().split('of origin')[1]
print(additional_info['Paese origine'])
# Original language
additional_info['Lingua originale'] = soup.select_one('[data-testid="title-details-languages"]').get_text()
print(additional_info['Lingua originale'])
return additional_info
data = []
for link in soup.select('a[href^="/title"]:has(h3)')[:25]:
serie_info = json.loads(
BeautifulSoup(
requests.get(f'https://www.imdb.com{link.get("href")}', headers=headers).content,
"html.parser"
).find("script", {"type": "application/ld+json"}).text
)
additional_info = extract_additional_info(f'https://www.imdb.com{link.get("href")}')
if additional_info is not None:
serie_info.update(additional_info)
data.append(serie_info)
dati = pd.json_normalize(data)
我在抓取季数时遇到问题,因为如果电视节目只有一季,则此信息包含在不同的路径中。所以我尝试使用“if”,但它不起作用,并且找不到文本“1 season”而是“TopTop- rating”。
此信息的网址是:
<div class="sc-e1ed1839-0 fRlQPg episodes-browse-episodes" data-testid="episodes-browse-
episodes"><div class="sc-e1ed1839-1 ihQgyV"><font style="vertical-align: inherit;"><font
style="vertical-align: inherit;">Browse episodes</font></font></div><div class="sc-e1ed1839-4
jngMUB"><a class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-
height ipc-btn--core-base ipc-btn--theme-base ipc-btn--on-accent2 ipc-text-button" role="button"
tabindex="0" aria-disabled="false" href="/title/tt5491994/episodes/?
topRated=DESC&ref_=tt_eps_rhs_sm"><span class="ipc-btn__text"><span class="sc-e1ed1839-3
hWiAfx"><font style="vertical-align: inherit;"><font style="vertical-align:
inherit;">Start</font></font></span><span class="sc-e1ed1839-2 cloUKk"><font style="vertical-
align: inherit;"><font style="vertical-align: inherit;">Most voted</font></font></span></span>
</a><a class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-
height ipc-btn--core-base ipc-btn--theme-base ipc-btn--on-accent2 ipc-text-button" role="button"
tabindex="0" aria-disabled="false" href="/title/tt5491994/episodes?
season=1&ref_=tt_eps_sn_1"><span class="ipc-btn__text"><font style="vertical-align:
inherit;"><font style="vertical-align: inherit;">1 Season</font></font></span></a><span
class="ipc-simple-select__container"><span class="ipc-simple-select ipc-simple-select--base ipc-
simple-select--on-accent2"><label for="browse-episodes-year" class="ipc-simple-select__label">
<font style="vertical-align: inherit;"><font style="vertical-align: inherit;">2 years</font>
</font></label><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" class="ipc-icon
ipc-icon--arrow-drop-down ipc-simple-select__icon--post" viewBox="0 0 24 24" fill="currentColor"
role="presentation"><path fill="none" d="M0 0h24v24H0V0z"></path><path d="M8.71 11.71l2.59
2.59c.39.39 1.02.39 1.41 0l2.59-2.59c.63-.63.18-1.71-.71-1.71H9.41c-.89 0-1.33 1.08-.7 1.71z">
</path></svg><select id="browse-episodes-year" aria-label="2 years" class="ipc-simple-
select__input"><option selected="" value=""></option><option value="2017">2017</option><option
value="2016"><font style="vertical-align: inherit;"><font style="vertical-align:
inherit;">2016</font></font></option><option value="SEE_ALL"><font style="vertical-align:
inherit;"><font style="vertical-align: inherit;">View all</font></font></option></select></span>
</span></div></div>
下一个问题是输出是这样的:
62
TopTop-rated
45
United States
LanguagesEnglishSpanish
6
TopTop-rated
50
United KingdomGermanyFranceChinaUnited States
LanguagesEnglishFrench
11
TopTop-rated
49
United KingdomCanadaUnited StatesJapan
LanguageEnglish
10
TopTop-rated
59
United KingdomUnited States
LanguagesEnglishDutchFrenchGermanLithuanian
5
TopTop-rated
41
United StatesUnited Kingdom
LanguagesEnglishRussianUkrainian
60
TopTop-rated
1 hour
United States
LanguagesEnglishGreekMandarinSpanish
因此,如果有多个元素,则国家和语言不会分开,并且我不知道在这种情况下如何使用“拆分”。我还想在语言之前消除语言。
最后,我是否应该将执行时间从“1小时”转换为“60分钟”,这样我的分析会更容易?如果是,我该怎么做? 谢谢
我不明白你的第一个问题是什么,但这是你的第二个问题的答案。 语言和国家/地区被连接起来,因为您在整个元素上调用
get_text
,因此它将所有文本组件粘合在一起。为了获得单一国家/地区,请使用以下行:
additional_info['Paese origine'] = soup.select_one('[data-testid="title-details-origin"]').find('a').get_text()
即导航到第一个
a
子元素并获取其文本。
为了获取所有语言,找到所有 a
子元素并获取它们的文本:
additional_info['Lingua originale'] = [e.get_text() for e in soup.select_one('[data-testid="title-details-languages"]').find_all('a')]