我有一个 xml 文档(保存在我的驱动器上):
xml="""
<?xml version="1.0">
<front>
<z id="37">some text sitting here</z>
<label></label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z>
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z></front>
"""
我想提取所有文本元素以存储在类似于这样的数据框中:
身份证 | 文字 |
---|---|
37 | 这里有一些文字 |
38 | 另一句话要读 |
... | ... |
40 | 这个文件叫做 101...帮助我! xml 文本堆栈指南 |
我用它来生成一个数据框,但它错过了位于额外标签内的文本
file = ('[my_file_location.xml')
tree = ET.parse(file)
root = tree.getroot()
xmltext = []
for z in root.iter('z'):
txt = z.text
xmltext.append(txt)
我显然可以得到“这里有一些文字”和“另一句话要读”。元素,但我无法从 p 标签内的元素中获取任何文本,即
std-a
、std-b
等
在这种情况下最简单的方法是使用
itertext()
获取文本。
import xml.etree.ElementTree as ET
from pprint import pprint
xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z>
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
root = ET.fromstring(xml)
xmltext = []
for z in root.iter('z'):
txt = "".join(z.itertext())
xmltext.append(txt)
pprint(xmltext)
打印输出...
['some text sitting here',
'Another sentence to read.',
'The contents of a document.',
'This document is called 101...help me!, Stcks guide to xml text. ']
您可以将 iterparse() 与 itertext() 一起使用,如 @Daniel Haley 中所述,如果需要,还可添加 pandas()。
import pandas as pd
import xml.etree.ElementTree as ET
from io import StringIO
xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z>
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
f = StringIO(xml)
data = []
columns = ['id', 'TEXT']
for event, elem in ET.iterparse(f, events=('start', 'end', 'comment', 'pi')):
#print(event, elem.tag, elem.attrib, elem.text, elem.tail)
if elem.get('id') is not None and event == 'end':
if elem.get('id').isnumeric() and elem.text:
# I gave @Daniel a vote
txt = "".join(elem.itertext())
print(elem.get('id'), txt)
row = elem.get('id'), txt
data.append(row)
print()
df = pd.DataFrame(data, columns=columns)
print(df.to_string(index=False))
输出:
37 some text sitting here
38 Another sentence to read.
39 The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text.
id TEXT
37 some text sitting here
38 Another sentence to read.
39 The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text.