解析描述标签时避免使用图像div

问题描述 投票:0回答:1

使用此代码解析 RSS feed

resp=requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")
soup.prettify()
items = soup.findAll('item')

news_items = []
for item in items:
    news_item={}
    news_item['title']=item.title.text
    news_item['description']=item.description.text
    news_item['link']=item.link.text
    news_item['pubDate']=item.pubDate.text
    news_items.append(news_item)

在描述标签中有一个用于img src的div

<description>
<![CDATA[ <div><img src="https://library.sportingnews.com/styles/twitter_card_120x120/s3/2023-11/nba-plain--358f0d81-148e-4590-ba34-3164ea0c87eb.png?itok=fG5f5Dwa" style="width: 100%;" /><div>Now back from his foot injury and ready to continue his Golden Boot charge, Erling Haaland looks to return in full as Man City visit Brentford in a Monday Premier League matinee.</div></div> ]]>
</description>

无论如何,我可以检索描述标签中除图像 div 之外的所有内容,谢谢

python parsing tags rss rss-reader
1个回答
0
投票

当然可以。 请参考这段代码:

from bs4 import BeautifulSoup
import requests

url = "your_rss_feed_url_here"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")
items = soup.findAll('item')

news_items = []
for item in items:
    news_item = {}
    news_item['title'] = item.title.text
    news_item['link'] = item.link.text
    news_item['pubDate'] = item.pubDate.text
    

    description_content = BeautifulSoup(item.description.text, "html.parser")
    # Remove the img tag
    img_tag = description_content.find('img')
    if img_tag:
        img_tag.decompose()
    
    # Assuming you want to keep the rest of the content as HTML
    news_item['description'] = str(description_content)
    
    # If you want to convert the HTML to text, use .text instead
    # news_item['description'] = description_content.text
    
    news_items.append(news_item)
© www.soinside.com 2019 - 2024. All rights reserved.