如何使用 Python 和 Beautiful Soup 从 BBC 文章中抓取标题?

问题描述 投票:0回答:1

我之前构建过一个 BBC 抓取工具,除其他外,它可以从给定文章中抓取标题,例如 this。然而,BBC 最近更改了他们的网站,所以我需要修改我的抓取工具,事实证明这很困难。例如,假设我想从前面提到的文章中抓取标题。使用 Firefox 检查 HTML,我找到了相应的 HTML 属性,即

data-component="headline-block"
(参见图像中的蓝色标记线)。

如果我想提取相应的标签,我会这样做:

import requests

from bs4 import BeautifulSoup

url = 'https://www.bbc.com/news/world-africa-68504329'

# extract html
html = requests.get(url).text

# parse html
soup = BeautifulSoup(html, 'html.parser')

# extract headline from soup
head = soup.find(attrs = {'data-component': 'headline-block'})

但是当我打印

head
的值时,它返回
None
,这意味着Beautiful Soup找不到标签。我缺少什么? 如何解决这个问题?

python web-scraping beautifulsoup
1个回答
0
投票

你在页面上看到的数据是以Json形式存储在页面内部的(所以看不到它)。要获取标题+文章文本,您可以使用以下示例:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.bbc.com/news/world-africa-68504329"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)

# print(json.dumps(data, indent=4))

page = next(
    v for k, v in data["props"]["pageProps"]["page"].items() if k.startswith("@")
)
for c in page["contents"]:
    match c["type"]:
        case "headline":
            print(c["model"]["blocks"][0]["model"]["text"])
            print()
        case "text":
            print(c["model"]["blocks"][0]["model"]["text"], end=" ")

print()

打印:

Kuriga kidnap: More than 280 Nigerian pupils abducted

More than 280 Nigerian school pupils have been abducted in the north-western town of Kuriga, officials say.  The pupils were in the assembly ground around 08:30 (07:30 GMT) when dozens of gunmen on motorcycles rode through the school, one witness said. The students, between the ages of eight and 15, were taken away, along with a teacher, they added. Kidnap gangs, known as bandits, have seized thousands of people in recent years, especially the north-west. However, there had been a reduction in the mass abduction of children over the past year until this week. Those kidnapped are usually freed after a ransom is paid. The mass abduction was

...
© www.soinside.com 2019 - 2024. All rights reserved.