我想在IMDB网站中保存一些奖励信息,但无法访问所需的JavaScript文本。
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.imdb.com/event/ev0000003/2000',
'https://www.imdb.com/event/ev0000003/2001',
]
for url in urls:
response = requests.get(url_test).content
soup = BeautifulSoup(response, 'html.parser')
soup.find_all('script', {'type':'text/javascript'})
现在,如何仅访问类别信息:
"categories":[{"categoryName":"Best Actor in a Leading Role","nominations":[{"primaryNominees":[{"name":"Kevin Spacey","note":null,"imageUrl":.....
由于我将不得不获得不同的奖项和年份,所以我的想法是将它们保存在json文件中:
{"award": "oscars",
"year": "2000",
"data": [{"categoryName":"Best Actor in a Leading Role","nominations":[{"primaryNominees":[{"name":"Kevin Spacey","note":null,"imageUrl":.....
}
json
模块。例如:
import re
import json
import requests
urls = [
'https://www.imdb.com/event/ev0000003/2000',
'https://www.imdb.com/event/ev0000003/2001',
]
for url in urls:
response = requests.get(url).text
data = json.loads( re.findall(r'IMDbReactWidgets\.NomineesWidget\.push.*?(\{.*\})', response)[0] )
# print(json.dumps(data, indent=4)) # <-- comment this out to print all data
for award in data['nomineesWidgetModel']['eventEditionSummary']['awards']:
if award['awardName'] != 'Oscar':
continue
for category in award['categories']:
print(category['categoryName'])
print('-' * 80)
打印:
Best Actor in a Leading Role Best Actor in a Supporting Role Best Actress in a Leading Role Best Actress in a Supporting Role Best Art Direction-Set Decoration Best Cinematography Best Costume Design Best Director Best Documentary, Features Best Documentary, Short Subjects Best Effects, Sound Effects Editing Best Effects, Visual Effects Best Film Editing Best Foreign Language Film ...and so on.