我需要从特定的 div 中删除文本

问题描述 投票:0回答:2

我必须从包含所有文章文本的 div 中删除文本,但 div 的类名不是唯一的,所以我尝试使用 CSS 选择器,但它返回一个空列表。

import requests
from bs4 import BeautifulSoup


def get_page_links(url):
    r = requests.get(url)
    sp = BeautifulSoup(r.text, 'lxml')
    links = sp.select('div.tdb-block-inner td-fix-index')
    print(links)


get_page_links(
    'https://insights.blackcoffer.com/ai-in-healthcare-to-improve-patient-outcomes/')
python html web-scraping beautifulsoup
2个回答
1
投票

试试这个 CSS 选择器:

.tdb_single_content .tdb-block-inner.td-fix-index
.

例如:

import requests
from bs4 import BeautifulSoup


headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48",
}

url = "https://insights.blackcoffer.com/ai-in-healthcare-to-improve-patient-outcomes/"
soup = (
    BeautifulSoup(requests.get(url, headers=headers).content, "lxml")
    .select_one(".tdb_single_content .tdb-block-inner.td-fix-index")
)
print(soup.select_one("p:nth-of-type(2)").getText(strip=True))

输出:

“If anything kills over 10 million people in the next few decades, it will be a highly infectious virus rather than a war. Not missiles but microbes.” Bill Gates’s remarks at a TED conference in 2014, right after the world had avoided the Ebola outbreak. When the new, unprecedented, invisible virus hit us, it met an overwhelmed and unprepared healthcare system and oblivious population. This public health emergency demonstrated our lack of scientific consideration and underlined the alarming need for robust innovations in our health and medical facilities. For the past few years, artificial intelligence has proven to be of tangible potential in the healthcare sectors, clinical practices, translational medical and biomedical research.

0
投票

最好选择

td-post-content
类而不是
tdb-block-inner
,因为在其他页面上具有
tdb-block-inner
类的元素可能会丢失。例如,如
https://insights.blackcoffer.com/what-is-the-future-of-mobile-apps/
页所示。

文章正文可以这样截取:

import requests
from bs4 import BeautifulSoup


def get_article_text(url):
    r = requests.get(url)
    sp = BeautifulSoup(r.text, "lxml")
    links = sp.select(".td-post-content")
    assert len(links) == 1
    return links[0].getText()


print(
    get_article_text(
        "https://insights.blackcoffer.com/ai-in-healthcare-to-improve-patient-outcomes/"
    )
)
© www.soinside.com 2019 - 2024. All rights reserved.