使用Python ElementTree从xml文档中提取文本

Question

我有以下格式的xml文档

<samples>
   <sample count="10" intentref="none">
      Remember to
      <annotation conceptref="cf1">
         <annotation conceptref="cf2">record</annotation>
      </annotation>
      the
      <annotation conceptref="cf3">movie</annotation>
      <annotation conceptref="cf4">Taxi driver</annotation>
   </sample>
</samples>

并且我想提取所有文本，未封装在注释标签中的文本或注释标签中的文本，以重建原始短语所以我的输出是->记住要录制电影出租车司机

问题显然是无法获得令牌'the'这是我的代码的片段

import xml.etree.ElementTree as ET 
    samples = ET.fromstring("""
     <samples>
     <sample count="10" intentref="none">Remember to<annotation conceptref="cf1"><annotation conceptref="cf2">record</annotation></annotation>the<annotation conceptref="cf3">movie</annotation><annotation conceptref="cf4">Taxi driver</annotation></sample>
     </samples>
    """)

    for sample in samples.iter("sample"):
        print ('***'+sample.text+'***'+sample.tail)
        for annotation in sample.iter('annotation'):
            print(annotation.text)
            for nested_annotation in annotation.getchildren():
                  print(nested_annotation.text)

我以为嵌套注释会成功..但不，这是结果

***Remember to'***

None
record
record
movie
Taxi driver

Answer 1

你很近。我会这样：

import xml.etree.ElementTree as ET


samples = ET.fromstring("""<samples>
   <sample count="10" intentref="none">
      Remember to
      <annotation conceptref="cf1">
         <annotation conceptref="cf2">record</annotation>
      </annotation>
      the
      <annotation conceptref="cf3">movie</annotation>
      <annotation conceptref="cf4">Taxi driver</annotation>
   </sample>
</samples>
""")


for page in samples.findall('.//'):
    text = page.text if page.text else ''
    tail = page.tail if page.tail else ''
    print(text + tail)

这将给你：


      Remember to




      the

record

movie

Taxi driver

您可能会发现单词的顺序不是您想要的，但是您可以通过记住同时具有尾部和文本的项并在该迭代之后插入尾部来解决此问题。不确定这是否是正确的方法。

Answer 2

我认为您正在寻找itertext方法：

itertext

完整代码：

# Iterate over all the sample block
for sample in tree.xpath('//sample'):
    print(''.join(sample.itertext()))

使用Python ElementTree从xml文档中提取文本

问题描述投票：1回答：2

2个回答

最新问题

使用Python ElementTree从xml文档中提取文本

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2