如何将网站中的 json 解析为 python?

问题描述 投票:0回答:1

我正在尝试从此页面抓取新闻标题。标题似乎包含在一对脚本标签内名为

App
的 json 对象中。如果你将来读到这篇文章,你可以假设它看起来像这样

    App = {
        "page":{"lang":"en","error":{"state":false,"type":null}},
        "system":{"referrer":null,"cookie":[],"params":{"get":[],"post":[]}},
        "components":{
            "search-fast-links":[{"name":"FY 2022 preliminary financial results","link":"\/en\/investors-and-media\/news\/press-releases\/08-02-2023\/","detail":""},{"name":"Re-domiciliation Q&A","link":"\/en\/investors-and-media\/shareholder-centre\/current-qa\/","detail":""}],
            "press-release":{
                "items":[
                    {
                        "name":"Q4 and FY 2023 production results","date":1706648400,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/31-01-2024\/?","theme":["Production results"],"files":[[{"name":"2024_01_31_Q4_Production_results_eng","type":"pdf","size":"402.15 Kb","link":"\/upload\/ib\/1\/24-01-31\/2024_01_31_Q4_Production_results_eng.pdf"}]]
                    },{
                        "name":"Notice regarding a change of a major shareholder","date":1706475600,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/29-01-2024\/?","theme":["Regulatory disclosures","Shareholder information"],"files":[[{"name":"2024_01_29_Notice regarding_a_change_of_a_major_shareholder_eng","type":"pdf","size":"279.34 Kb","link":"\/upload\/ib\/1\/24-01-29\/2024_01_29_Notice regarding_a_change_of_a_major_shareholder_eng.pdf"}]]
                    },{
                        "name":"Nominated brokers for the purpose of the Exchange Offer","date":1705525200,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/18-01-2024\/?","theme":["Shareholder information"],"files":[[{"name":"2024_01_18_Nominated_brokers_eng","type":"pdf","size":"202.73 Kb","link":"\/upload\/ib\/1\/24-01-18\/2024_01_18_Nominated_brokers_eng.pdf"}]]
                    },{
                        "name":"Total Voting Rights as at 29 December 2023 ","date":1703797200,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/29-12-2023\/?","theme":["Regulatory disclosures"],"files":[[{"name":"2023_12_29_TVR_eng","type":"pdf","size":"114.49 Kb","link":"\/upload\/ib\/1\/23-12-29\/2023_12_29_TVR_eng.pdf"}]]
                    },{
                        ...
                    }
                ],
                ...
            },
            ...
        }
    };

我的问题如下:将其解析为 python 并将其识别为 json 的内容的最佳方法是什么?

  1. 我查看了:“js2py”,但我找不到任何可以实现我想要的功能。
  2. 我也尝试过使用string.replace。将所有布尔值和 nonetype 替换为与 javascript 相当的 python 后,我能够将其通过 json.load,但我关心的是简单地将 'false' 的每个子字符串替换为 'False',将 'null' 替换为 'None' ' 因为数据将来可能会发生变化,使得“false”或“null”出现在其他一些非布尔子字符串的中间,并且通过替换它,内容可能会以不可预测的方式发生更改。
  3. 我还查看了这个问题,乍一看似乎是同一个问题,但提供的答案特定于OP提供的json数据。拥有一个独立于实际内容且适用于所有 json 的答案将是积极的。
  4. 我尝试删除
    App =
    ;
    ,并将 json 放入名为
    string
    的变量中,并通过 json.loads 放入。但我遇到了很多错误:
>>> json.loads(string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 2 column 10378 (char 10378)
javascript python json parsing
1个回答
0
投票

我不知道你使用的是selenium还是BeautifulSoup,但是 尝试:

import requests
from bs4 import BeautifulSoup
import json

url = 'https://polymetalinternational.com/en/investors-and-media/news/press-releases/'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the script tag containing the 'App' object
    def filter_scripts(text):
     
        if text:
            return "App = " in text
        return False
    script_tag = soup.find('script', text=a)

    app_content = script_tag.text.split('App = ')[1].strip()[:-1]

    # Load the content as JSON
    app_data = json.loads(app_content.replace("\n", "").replace("\r", "").split("};")[0] + "}" )

 
    headlines = app_data['components']['press-release']['items']

    for headline in headlines:
        print(f"Name: {headline['name']}")
        print(f"Date: {headline['date']}")
        print(f"Link: {headline['link']}")
        print("------")

© www.soinside.com 2019 - 2024. All rights reserved.