lxml 获取没有标签的元素的文本

问题描述 投票:0回答:1

我正在使用 lxml 库和 python 来解析一个简单的 XML,该 XML 在本例中打印下一个元素的文本 HD,如下面的 XML 所示

<BOOK>
   <HD>The Best Book Ever</HD>
   <HD>Table of Contents</HD>
   <EXTRACT>
      <TC>I. Introduction</TC>
      <TC>II. Summary</TC>
      <TC>III. Topic 1</TC>
      <TC>IV. Topic 2</TC>
   </EXTRACT>
   <HD>I. Introduction</HD>
   <p>
      Lorem Ipsum is simply dummy text of the printing and typesetting industry.
      <FTN>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget</FTN>
   </p>
   <p>has been the industry standard dummy text ever since the 1500s</p>
   <HD>II. Summary</HD>
   <p>
       <FT>data 1</FT>
       data 2
      <FT>data 3</FT>
   </p>
    <p>
       <FT>data 4</FT>
       data 5
      <FT>data 6</FT>
   </p>
   <p>has been the industry standard dummy text ever since the 1500s</p>
   <HD>III. Topic 1</HD>
   <p>
      something
      <p>something else</p>
   </p>
   <HD>IV. Topic 2</HD>
   <p>
      something1
      <p>something else 1</p>
   </p>
   <p>
      something 2
      <p>something else 2</p>
   </p>
   <HD>V. Topic 3</HD>
   <p>
      something not to show up
      <p>because not in EXTRACT as TC</p>
   </p>
</BOOK>

我的Python代码如下所示,它应该打印HD标签旁边的所有内容

import os
from lxml import etree

file_name = 'demofile2.xml'
full_file_name = os.path.abspath(os.path.join('', file_name))

def load_local_file(filename):
    dom = etree.parse(filename)

    #get all content of elements after HD tag
    TOCsHD = dom.getroot().findall('HD')
    for hd in TOCsHD:
        text = hd.text
        print(text)
        for x in hd.getnext().iter():
            print(x.text)
            print(x.tail)
        print("------------------------------")


load_local_file(full_file_name)

我的输出如下所示。正如你所看到的,II。例如,摘要不打印数据 4、数据 5、数据 6。有人可以帮我解决这个问题吗?非常感谢!

The Best Book Ever
Table of Contents

   
------------------------------
Table of Contents

      

   
I. Introduction

      
II. Summary

      
III. Topic 1

      
IV. Topic 2

   
------------------------------
I. Introduction

      Lorem Ipsum is simply dummy text of the printing and typesetting industry.
      

   
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget

   
------------------------------
II. Summary

       

    
data 1

       data 2
      
data 3

   
------------------------------
III. Topic 1

      something
      

   
something else

   
------------------------------
IV. Topic 2

      something1
      

   
something else 1

   
------------------------------
V. Topic 3

      something not to show up
      


because not in EXTRACT as TC

   
------------------------------
python xml-parsing lxml
1个回答
0
投票

我不确定,但猜猜你需要的可能是

itersiblings

import os
from lxml import etree

file_name = 'demofile2.xml'
full_file_name = os.path.abspath(os.path.join('', file_name))

def load_local_file(filename):
    dom = etree.parse(filename)

    #get all content of elements after HD tag
    TOCsHD = dom.getroot().findall('HD')
    for hd in TOCsHD:
      print("Siblings of: " + hd.text)
      theIter = hd.itersiblings()
      for x in theIter:
          print(x.tag, "".join(x.itertext()).strip().replace("\n", ""), sep=": ")
      print("------------------------------")

load_local_file(full_file_name)

我不确定这是否是您正在寻找的结果,但如果您对标签的同级感兴趣,此功能将起作用。

输出

Siblings of: The Best Book Ever
HD: Table of Contents
EXTRACT: I. Introduction      II. Summary      III. Topic 1      IV. Topic 2
HD: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry.      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1       data 2      data 3
p: data 4       data 5      data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: Table of Contents
EXTRACT: I. Introduction      II. Summary      III. Topic 1      IV. Topic 2
HD: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry.      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1       data 2      data 3
p: data 4       data 5      data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry.      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1       data 2      data 3
p: data 4       data 5      data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: II. Summary
p: data 1       data 2      data 3
p: data 4       data 5      data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------

请注意,您还需要使用

itertext
才能获取所有标签内的所有文本。例如,有一些
p
标签内部有内部标签。如果您想获得这些
p
标签的文本值,则需要应用
itertext
才能获取内部文本。您可以通过查看带有
"".join(x.itertext()).strip().replace("\n", "")
的行来深入了解该过程。

© www.soinside.com 2019 - 2024. All rights reserved.