无法使用Beautifulsoup和Request从span标签中提取文本

问题描述 投票:0回答:1

我正在尝试删除此在线论坛上的帖子。 https://csn.cancer.org/categories/prostate 所有帖子似乎都在跨度标签中。

我使用下面的代码来废弃帖子。

import requests
from bs4 import BeautifulSoup as bs

url = f"https://csn.cancer.org/categories/prostate"
response = requests.get(url)

soup = bs(response.text, 'html.parser')
ps = soup.findAll('span',attrs ={ 'class':"css-yjh1h7-TruncatedText-styles-truncated"})
for p in ps:
    print(p.text) 

我什么也没得到。从跨度('class':“css-yjh1h7-TruncatedText-styles-truncated”)中,我无法提取任何内容。

但是,令我困惑的是,如果我从网站上获取部分html代码并执行以下操作:

html ='''  <div class="css-e9znss-ListItem-styles-item"><div class="css-1flthw6-ListItem-styles-iconContainer"><div class="css-n9xrp8-ListItem-styles-icon"><div class="css-7yur4m-DiscussionList-classes-userIcon"><a aria-current="false" href="https://csn.cancer.org/profile/336163/PhoenixM" tabindex="0" data-link-type="legacy"><div class="css-1eztffh-userPhotoStyles-medium css-11wwpgq-userPhotoStyles-root isOpen"><img title="PhoenixM" alt="User: &quot;PhoenixM&quot;" height="200" width="200" src="https://w6.vanillicon.com/v2/62fd0812499add3e38f8e90eee3af967.svg" class="css-10y567c-userPhotoStyles-photo" loading="lazy"></div></a></div></div></div><div class="css-2guvez-ListItem-styles-contentContainer"><div class="css-1kxjkhx-ListItem-styles-titleContainer"><h3 class="css-glebqx-ListItem-styles-title"><a aria-current="false" href="https://csn.cancer.org/discussion/327814/brachytherapy" class="css-ojxxy9-ListItem-styles-titleLink-DiscussionList-classes-title" tabindex="0" data-link-type="legacy"><span class="css-yjh1h7-TruncatedText-styles-truncated">Brachytherapy</span></a></h3></div><div class="css-1y6ygw7-ListItem-styles-metaWrapContainer"><div class="css-5swiwf-ListItem-styles-metaDescriptionContainer"><p class="css-1ggegep-ListItem-styles-description"><span class="css-yjh1h7-TruncatedText-styles-truncated">I have recently been diagnosed with locally advanced prostate cancer. Gleason 9 stage 4a. My cancer has spread outside my prostate to a very enlarged lymph node in my pelvic region. I’m currently taki…</span></p><div class="css-1uyxq88-Metas-styles-root css-h3lbxm-ListItem-styles-metasContainer"><div class="css-1a607mt-Metas-styles-meta">135 views</div><div class="css-1a607mt-Metas-styles-meta">6 comments</div><div class="css-1a607mt-Metas-styles-meta">0 point</div><div class="css-1a607mt-Metas-styles-meta">Started by <a aria-current="false" href="https://csn.cancer.org/profile/336163/PhoenixM" class="css-1unw87s-Metas-styles-metaLink" tabindex="0" data-link-type="legacy">PhoenixM</a></div><div class="css-1a607mt-Metas-styles-meta">Most recent by <a aria-current="false" href="https://csn.cancer.org/profile/285710/Steve1961" class="css-1unw87s-Metas-styles-metaLink" tabindex="0" data-link-type="legacy">Steve1961</a></div><div class="css-1a607mt-Metas-styles-meta"><time datetime="2024-03-22T01:59:58+00:00" title="Thursday, March 21, 2024 at 9:59 PM">Mar 21, 2024</time></div><div class="css-1a607mt-Metas-styles-meta"><a aria-current="false" href="https://csn.cancer.org/categories/prostate" class="css-1unw87s-Metas-styles-metaLink" tabindex="0" data-link-type="legacy"> Prostate Cancer </a></div></div></div></div></div><div class="css-1pv9k2p-ListItem-styles-actionsContainer"></div></div> '''

from bs4 import BeautifulSoup as bs
import requests

# Parse the HTML with BeautifulSoup
soup = bs(html, 'html.parser')

ps = soup.findAll('span',attrs ={ 'class':"css-yjh1h7-TruncatedText-styles-truncated"})

for p in ps:
    print(p.text)

我能够提取帖子内容。谁能帮助我理解为什么我无法从网站网址中提取任何内容?我做错了什么?预先感谢。

python web-scraping beautifulsoup python-requests
1个回答
0
投票

您面临的问题是您尝试抓取的网站正在使用 JavaScript 来加载内容,而 BeautifulSoup 无法执行 JavaScript。因此,您尝试抓取的内容不会出现在 BeautifulSoup 正在解析的 HTML 中。

要解决这个问题,您可以使用像Selenium或Scrapy这样的库,它们可以执行JavaScript并加载页面内容。

以下是如何使用 Selenium 抓取网站内容的示例:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("https://csn.cancer.org/categories/prostate")

driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')

posts = soup.find_all('span', attrs={'class': 'css-yjh1h7-TruncatedText-styles-truncated'})

for post in posts:
    print(post.text)

driver.quit()
© www.soinside.com 2019 - 2024. All rights reserved.