xml
包。它是 stdlib 的一部分并且易于使用。另外,它提供了一个很好的教程。
import xml.etree.ElementTree as ET
text_1 = '<abstract lang="en" source="my_source" format="org"><p id="A-0001" num="none">My text is here </p><img file="Uxx.md" /></abstract>'
root = ET.fromstring(text_1)
您可以访问数据:
print(root.tag, root.attrib)
for child in root:
print(child.tag, root.attrib)
abstract {'lang': 'en', 'source': 'my_source', 'format': 'org'}
p {'id': 'A-0001', 'num': 'none'}
img {'file': 'Uxx.md'}
编辑: 要查看
<p>
元素的文本:
root[0].text
'My text is here '
您还可以通过
root
获取有关child
和Element
(均为help()
)成员的信息。
help(root)
class Element(builtins.object)
| Methods defined here:
|
| __copy__(self, /)
|
| __deepcopy__(self, memo, /)
|
...
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| attrib
| A dictionary containing the element's attributes
|
| tag
| A string identifying what kind of data this element represents
|
| tail
| A string of text directly after the end tag, or None
|
| text
| A string of text directly after the start tag, or None
findtext
与 elementtree:
import xml.etree.ElementTree as ET
xmls = [text_1, text_2]
texts = [ET.fromstring(x).findtext("p").strip() for x in xmls]
或者,使用beautifulsoup:
#pip install beautifulsoup4
from bs4 import BeautifulSoup
texts = [BeautifulSoup(x, "lxml").text.strip() for x in xmls]
输出:
print(texts) # ['My text is here', 'Another text.']
您可以使用 xmltodict 模块
pip install xmltodict
然后用它来将xml格式字符串转换为字典
xmltodict.parse(xml_strings)