在 Python 中从 XML 格式的字符串中检索文本

Question

Answer 1

使用

xml

包。它是 stdlib 的一部分并且易于使用。另外，它提供了一个很好的教程。

import xml.etree.ElementTree as ET
text_1 = '<abstract lang="en" source="my_source" format="org"><p id="A-0001" num="none">My text is here </p><img file="Uxx.md" /></abstract>'
root = ET.fromstring(text_1)

您可以访问数据：

print(root.tag, root.attrib)
for child in root:
    print(child.tag, root.attrib)

abstract {'lang': 'en', 'source': 'my_source', 'format': 'org'}
p {'id': 'A-0001', 'num': 'none'}
img {'file': 'Uxx.md'}

编辑： 要查看

<p>

元素的文本：

root[0].text

'My text is here '

您还可以通过

root

获取有关

child

和

Element

（均为

help()

）成员的信息。

help(root)

class Element(builtins.object)
 |  Methods defined here:
 |
 |  __copy__(self, /)
 |
 |  __deepcopy__(self, memo, /)
 |
 ...
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  attrib
 |      A dictionary containing the element's attributes
 |
 |  tag
 |      A string identifying what kind of data this element represents
 |
 |  tail
 |      A string of text directly after the end tag, or None
 |
 |  text
 |      A string of text directly after the start tag, or None

Answer 2

您的预期输出尚不清楚，但无论如何，您可能需要

findtext

与 elementtree:

import xml.etree.ElementTree as ET

xmls = [text_1, text_2]

texts = [ET.fromstring(x).findtext("p").strip() for x in xmls]

或者，使用beautifulsoup：

#pip install beautifulsoup4
from bs4 import BeautifulSoup

texts = [BeautifulSoup(x, "lxml").text.strip() for x in xmls]

输出：

print(texts) # ['My text is here', 'Another text.']

Answer 3

您可以使用 xmltodict 模块

pip install xmltodict

然后用它来将xml格式字符串转换为字典

xmltodict.parse(xml_strings)

在 Python 中从 XML 格式的字符串中检索文本

问题描述投票：0回答：3

3个回答

最新问题

在 Python 中从 XML 格式的字符串中检索文本

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3