使用Python ElementTree解析XML以提取特定数据

Question

我有一个需要解析的 xml 文件。我对python和xml的理解比较模糊。我正在使用 ElementTree 来解析文档，但是我在网上研究过的几乎所有参考资料都让我走上了正确的道路，但我在用例的某个特定方面存在不足。如果我的 xml 看起来像：

<root/>
    <blah>
       <blah/>
    <blah/>
    <blah>
       <blah/>
    <blah/>
    <CheckList>
       <CheckListChild>
           <category1>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
               <blah RuleTitle = "description" status = "Closed"/>
            <category1/>
            <category2>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
            <category2/>
            <category3>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
            <category3/>

我想编写一个 for 循环来迭代树中的子项，但我只关心具有“状态 = 打开”的类别 1 子项。我读过的每个参考文献（包括 python 文档）都会引导我提取状态为“打开”的所有属性，但这会导致我提取我不想要的类别 2 和类别 3。有什么方法可以将其隔离到第一类儿童吗？我想最终将输出附加到列表中，但一旦隔离此输出就可以弄清楚。

我尝试过类似的方法，但这不起作用：

import xml.etree.ElementTree as ET 

report = xmldocument.xml
tree = ET.parse(report)
root = tree.getroot()

cat1findings = []
for item in root.iter('category1'):
    for blah in root.iter(blah):
      if blah.attrib == 'Status="Open"':
        cat1findings.append(output)

蒂亚！

Answer 1

我会使用

lxml

和

xpath

来找到它。

text = '''<root/>
    <blah>
       <blah/>
    <blah/>
    <blah>
       <blah/>
    <blah/>
    <CheckList>
       <CheckListChild>
           <category1>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open">Hello World!</blah>
               <blah RuleTitle = "description" status = "Closed"/>
            <category1/>
            <category2>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
            <category2/>
            <category3>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Closed"/>
               <blah RuleTitle = "description" status = "Open"/>
            <category3/>
'''

import lxml.html

soup = lxml.html.fromstring(text)

data = soup.xpath('.//category1/blah[@status="Open"]')

for item in data:
    print('Parent:', item.getparent().tag)
    print('Status:', item.attrib['status'])
    print('RuleTitle:', item.attrib['ruletitle'])  # it needs lower case `ruletitle` instead of `RuleTitle`
    print('Text:', item.text)
    print('XML:', lxml.html.tostring(item))

结果：

Parent: category1
Status: Open
RuleTitle: description
Text: Hello World!
XML: b'<blah ruletitle="description" status="Open">Hello World!</blah>\n               '

使用Python ElementTree解析XML以提取特定数据

问题描述投票：0回答：1

1个回答

最新问题

使用Python ElementTree解析XML以提取特定数据

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1