使用Beautiful Soup访问JavaScript文本

问题描述 投票:1回答:1

我想在IMDB网站中保存一些奖励信息,但无法访问所需的JavaScript文本。

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.imdb.com/event/ev0000003/2000',
    'https://www.imdb.com/event/ev0000003/2001',
]

for url in urls:
    response = requests.get(url_test).content
    soup = BeautifulSoup(response, 'html.parser')
    soup.find_all('script', {'type':'text/javascript'})


现在,如何仅访问类别信息:

"categories":[{"categoryName":"Best Actor in a Leading Role","nominations":[{"primaryNominees":[{"name":"Kevin Spacey","note":null,"imageUrl":.....  

由于我将不得不获得不同的奖项和年份,所以我的想法是将它们保存在json文件中:

{"award": "oscars",  
 "year": "2000",  
 "data": [{"categoryName":"Best Actor in a Leading Role","nominations":[{"primaryNominees":[{"name":"Kevin Spacey","note":null,"imageUrl":.....  
}
python json web-scraping beautifulsoup
1个回答
1
投票
数据存储在页面的javascript中,因此您可以通过例如regexp访问它。要解析数据,可以使用json模块。

例如:

import re import json import requests urls = [ 'https://www.imdb.com/event/ev0000003/2000', 'https://www.imdb.com/event/ev0000003/2001', ] for url in urls: response = requests.get(url).text data = json.loads( re.findall(r'IMDbReactWidgets\.NomineesWidget\.push.*?(\{.*\})', response)[0] ) # print(json.dumps(data, indent=4)) # <-- comment this out to print all data for award in data['nomineesWidgetModel']['eventEditionSummary']['awards']: if award['awardName'] != 'Oscar': continue for category in award['categories']: print(category['categoryName']) print('-' * 80)

打印:

Best Actor in a Leading Role Best Actor in a Supporting Role Best Actress in a Leading Role Best Actress in a Supporting Role Best Art Direction-Set Decoration Best Cinematography Best Costume Design Best Director Best Documentary, Features Best Documentary, Short Subjects Best Effects, Sound Effects Editing Best Effects, Visual Effects Best Film Editing Best Foreign Language Film ...and so on.

© www.soinside.com 2019 - 2024. All rights reserved.