美丽的汤django解析

Question

我需要从带有标签描述的XML文件中获取数据/字符串。我有：

<description><img src="https://www.somepicture.jpeg" align="left" hspace="8" width="400" height="200" /> DESCRIPTION TEXT I WANT TO PARSE </description>

我正在使用BeautifoulSoup4和Django，之前我制作了新汤，我从中解析了一个项目。如果我尝试“item.description.text”，我也会得到这个img标签。我怎样才能逃脱它，获得理想的描述？

编辑：需要将此解析后的文本保存在数据库中。喜欢：

for item in items: 
  tagA = item.tagA.text
  tagB = item.tagB.text
  description = item.description.text  <--- here's parsed text that I need without img tag
  model = MyModel.objects.create(tag_a_field=tagA, tag_b_field=tagB, description_field=description)
  model.save()

谢谢

Answer 1

这里的问题是img部分是文本。这是描述的一部分，这就是为什么BeautifulSoup不会将其解析为html标记。

解决问题的一种天真的方法是再次解析该文本：

html = '<description>&lt;img src="https://www.somepicture.jpeg" align="left" hspace="8" width="400" height="200" /&gt; DESCRIPTION TEXT I WANT TO PARSE </description>'
soup = BeautifulSoup(html)
description_soup = BeautifulSoup(soup.description.text)
description_soup.text
>>> ' DESCRIPTION TEXT I WANT TO PARSE '

在你的情况下（根据提供的一些信息）你可以写如下：

for item in items:
    tagA = item.tagA.text
    tagB = item.tagB.text
    description_soup = BeautifulSoup(item.description.text)
    description = description_soup.text
    MyModel.objects.create(tag_a_field=tagA, tag_b_field=tagB, description_field=description)

Answer 2

你可以试试这个：

from bs4 import BeautifulSoup
html_doc = '<description>&lt;img src="https://www.somepicture.jpeg" align="left" hspace="8" width="400" height="200" /&gt; DESCRIPTION TEXT I WANT TO PARSE </description>'
soup = BeautifulSoup(html_doc, 'html.parser')
inner_soup = BeautifulSoup(soup.description.text, 'html.parser')
print(inner_soup.img.next_sibling)

美丽的汤django解析

问题描述投票：0回答：2

2个回答

最新问题

美丽的汤django解析

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2