我正在尝试从此页面抓取新闻标题。标题似乎包含在一对脚本标签内名为
App
的 json 对象中。如果你将来读到这篇文章,你可以假设它看起来像这样
App = {
"page":{"lang":"en","error":{"state":false,"type":null}},
"system":{"referrer":null,"cookie":[],"params":{"get":[],"post":[]}},
"components":{
"search-fast-links":[{"name":"FY 2022 preliminary financial results","link":"\/en\/investors-and-media\/news\/press-releases\/08-02-2023\/","detail":""},{"name":"Re-domiciliation Q&A","link":"\/en\/investors-and-media\/shareholder-centre\/current-qa\/","detail":""}],
"press-release":{
"items":[
{
"name":"Q4 and FY 2023 production results","date":1706648400,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/31-01-2024\/?","theme":["Production results"],"files":[[{"name":"2024_01_31_Q4_Production_results_eng","type":"pdf","size":"402.15 Kb","link":"\/upload\/ib\/1\/24-01-31\/2024_01_31_Q4_Production_results_eng.pdf"}]]
},{
"name":"Notice regarding a change of a major shareholder","date":1706475600,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/29-01-2024\/?","theme":["Regulatory disclosures","Shareholder information"],"files":[[{"name":"2024_01_29_Notice regarding_a_change_of_a_major_shareholder_eng","type":"pdf","size":"279.34 Kb","link":"\/upload\/ib\/1\/24-01-29\/2024_01_29_Notice regarding_a_change_of_a_major_shareholder_eng.pdf"}]]
},{
"name":"Nominated brokers for the purpose of the Exchange Offer","date":1705525200,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/18-01-2024\/?","theme":["Shareholder information"],"files":[[{"name":"2024_01_18_Nominated_brokers_eng","type":"pdf","size":"202.73 Kb","link":"\/upload\/ib\/1\/24-01-18\/2024_01_18_Nominated_brokers_eng.pdf"}]]
},{
"name":"Total Voting Rights as at 29 December 2023 ","date":1703797200,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/29-12-2023\/?","theme":["Regulatory disclosures"],"files":[[{"name":"2023_12_29_TVR_eng","type":"pdf","size":"114.49 Kb","link":"\/upload\/ib\/1\/23-12-29\/2023_12_29_TVR_eng.pdf"}]]
},{
...
}
],
...
},
...
}
};
我的问题如下:将其解析为 python 并将其识别为 json 的内容的最佳方法是什么?
App =
和 ;
,并将 json 放入名为 string
的变量中,并通过 json.loads 放入。但我遇到了很多错误:>>> json.loads(string) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.9/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.9/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting ',' delimiter: line 2 column 10378 (char 10378)
我不知道你使用的是selenium还是BeautifulSoup,但是 尝试:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://polymetalinternational.com/en/investors-and-media/news/press-releases/'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Find the script tag containing the 'App' object
def filter_scripts(text):
if text:
return "App = " in text
return False
script_tag = soup.find('script', text=a)
app_content = script_tag.text.split('App = ')[1].strip()[:-1]
# Load the content as JSON
app_data = json.loads(app_content.replace("\n", "").replace("\r", "").split("};")[0] + "}" )
headlines = app_data['components']['press-release']['items']
for headline in headlines:
print(f"Name: {headline['name']}")
print(f"Date: {headline['date']}")
print(f"Link: {headline['link']}")
print("------")