使用Python ElementTree解析XML以提取特定数据

问题描述 投票:0回答:1

我有一个需要解析的 xml 文件。我对python和xml的理解比较模糊。我正在使用 ElementTree 来解析文档,但是我在网上研究过的几乎所有参考资料都让我走上了正确的道路,但我在用例的某个特定方面存在不足。如果我的 xml 看起来像:

<root/>
    <blah>
       <blah/>
    <blah/>
    <blah>
       <blah/>
    <blah/>
    <CheckList>
       <CheckListChild>
           <category1>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
               <blah RuleTitle = "description" status = "Closed"/>
            <category1/>
            <category2>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
            <category2/>
            <category3>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
            <category3/>

我想编写一个 for 循环来迭代树中的子项,但我关心具有“状态 = 打开”的类别 1 子项。我读过的每个参考文献(包括 python 文档)都会引导我提取状态为“打开”的所有属性,但这会导致我提取我不想要的类别 2 和类别 3。有什么方法可以将其隔离到第一类儿童吗?我想最终将输出附加到列表中,但一旦隔离此输出就可以弄清楚。

我尝试过类似的方法,但这不起作用:

import xml.etree.ElementTree as ET 

report = xmldocument.xml
tree = ET.parse(report)
root = tree.getroot()

cat1findings = []
for item in root.iter('category1'):
    for blah in root.iter(blah):
      if blah.attrib == 'Status="Open"':
        cat1findings.append(output)

蒂亚!

python xml elementtree
1个回答
0
投票

我会使用

lxml
xpath
来找到它。

text = '''<root/>
    <blah>
       <blah/>
    <blah/>
    <blah>
       <blah/>
    <blah/>
    <CheckList>
       <CheckListChild>
           <category1>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open">Hello World!</blah>
               <blah RuleTitle = "description" status = "Closed"/>
            <category1/>
            <category2>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
            <category2/>
            <category3>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
            <category3/>
'''

import lxml.html

soup = lxml.html.fromstring(text)

data = soup.xpath('.//category1/blah[@status="Open"]')

for item in data:
    print('Parent:', item.getparent().tag)
    print('Status:', item.attrib['status'])
    print('RuleTitle:', item.attrib['ruletitle'])  # it needs lower case `ruletitle` instead of `RuleTitle`
    print('Text:', item.text)
    print('XML:', lxml.html.tostring(item))

结果:

Parent: category1
Status: Open
RuleTitle: description
Text: Hello World!
XML: b'<blah ruletitle="description" status="Open">Hello World!</blah>\n               '
© www.soinside.com 2019 - 2024. All rights reserved.