使用 ElemTree Python 从 xml 标签和标签的可选子元素中获取文本元素

Question

我有一个 xml 文档（保存在我的驱动器上）：

xml="""
<?xml version="1.0">
<front>
<z id="37">some text sitting here</z>
<label>&#26;</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z></front>
"""

我想提取所有文本元素以存储在类似于这样的数据框中：

身份证	文字
37	这里有一些文字
38	另一句话要读
...	...
40	这个文件叫做 101...帮助我！ xml 文本堆栈指南

我用它来生成一个数据框，但它错过了位于额外标签内的文本

file = ('[my_file_location.xml')
tree = ET.parse(file)
root = tree.getroot()

xmltext = []

for z in root.iter('z'):
    txt = z.text
    xmltext.append(txt)

我显然可以得到“这里有一些文字”和“另一句话要读”。元素，但我无法从 p 标签内的元素中获取任何文本，即

std-a

、

std-b

等

Answer 1

在这种情况下最简单的方法是使用

itertext()

获取文本。

import xml.etree.ElementTree as ET
from pprint import pprint

xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
root = ET.fromstring(xml)

xmltext = []

for z in root.iter('z'):
    txt = "".join(z.itertext())
    xmltext.append(txt)

pprint(xmltext)

打印输出...

['some text sitting here',
 'Another sentence to read.',
 'The contents of a document.',
 'This document is called 101...help me!, Stcks guide to xml text. ']

Answer 2

您可以将 iterparse() 与 itertext() 一起使用，如 @Daniel Haley 中所述，如果需要，还可添加 pandas()。


import pandas as pd
import xml.etree.ElementTree as ET
from io import StringIO

xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
f = StringIO(xml)

data = []
columns = ['id', 'TEXT']

for event, elem in ET.iterparse(f, events=('start', 'end', 'comment', 'pi')):
    #print(event, elem.tag, elem.attrib, elem.text, elem.tail)
    if elem.get('id') is not None and event == 'end':
        if elem.get('id').isnumeric() and elem.text:
            # I gave @Daniel a vote
            txt = "".join(elem.itertext())
            print(elem.get('id'), txt)
            row = elem.get('id'), txt
            data.append(row)
            

print()           
df = pd.DataFrame(data, columns=columns)
print(df.to_string(index=False))

输出：

37 some text sitting here
38 Another sentence to read.
39 The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text. 

id                                                              TEXT
37                                            some text sitting here
38                                         Another sentence to read.
39                                       The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text.

使用 ElemTree Python 从 xml 标签和标签的可选子元素中获取文本元素

问题描述投票：0回答：2

2个回答

最新问题

使用 ElemTree Python 从 xml 标签和标签的可选子元素中获取文本元素

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2