使用此代码解析 RSS feed
resp=requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")
soup.prettify()
items = soup.findAll('item')
news_items = []
for item in items:
news_item={}
news_item['title']=item.title.text
news_item['description']=item.description.text
news_item['link']=item.link.text
news_item['pubDate']=item.pubDate.text
news_items.append(news_item)
在描述标签中有一个用于img src的div
<description>
<![CDATA[ <div><img src="https://library.sportingnews.com/styles/twitter_card_120x120/s3/2023-11/nba-plain--358f0d81-148e-4590-ba34-3164ea0c87eb.png?itok=fG5f5Dwa" style="width: 100%;" /><div>Now back from his foot injury and ready to continue his Golden Boot charge, Erling Haaland looks to return in full as Man City visit Brentford in a Monday Premier League matinee.</div></div> ]]>
</description>
无论如何,我可以检索描述标签中除图像 div 之外的所有内容,谢谢
当然可以。 请参考这段代码:
from bs4 import BeautifulSoup
import requests
url = "your_rss_feed_url_here"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")
items = soup.findAll('item')
news_items = []
for item in items:
news_item = {}
news_item['title'] = item.title.text
news_item['link'] = item.link.text
news_item['pubDate'] = item.pubDate.text
description_content = BeautifulSoup(item.description.text, "html.parser")
# Remove the img tag
img_tag = description_content.find('img')
if img_tag:
img_tag.decompose()
# Assuming you want to keep the rest of the content as HTML
news_item['description'] = str(description_content)
# If you want to convert the HTML to text, use .text instead
# news_item['description'] = description_content.text
news_items.append(news_item)