我有一个需要解析的 xml 文件。我对python和xml的理解比较模糊。我正在使用 ElementTree 来解析文档,但是我在网上研究过的几乎所有参考资料都让我走上了正确的道路,但我在用例的某个特定方面存在不足。如果我的 xml 看起来像:
<root/>
<blah>
<blah/>
<blah/>
<blah>
<blah/>
<blah/>
<CheckList>
<CheckListChild>
<category1>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Open"/>
<blah RuleTitle = "description" status = "Closed"/>
<category1/>
<category2>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Open"/>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Closed"/>
<category2/>
<category3>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Open"/>
<category3/>
我想编写一个 for 循环来迭代树中的子项,但我只关心具有“状态 = 打开”的类别 1 子项。我读过的每个参考文献(包括 python 文档)都会引导我提取状态为“打开”的所有属性,但这会导致我提取我不想要的类别 2 和类别 3。有什么方法可以将其隔离到第一类儿童吗?我想最终将输出附加到列表中,但一旦隔离此输出就可以弄清楚。
我尝试过类似的方法,但这不起作用:
import xml.etree.ElementTree as ET
report = xmldocument.xml
tree = ET.parse(report)
root = tree.getroot()
cat1findings = []
for item in root.iter('category1'):
for blah in root.iter(blah):
if blah.attrib == 'Status="Open"':
cat1findings.append(output)
蒂亚!
我会使用
lxml
和 xpath
来找到它。
text = '''<root/>
<blah>
<blah/>
<blah/>
<blah>
<blah/>
<blah/>
<CheckList>
<CheckListChild>
<category1>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Open">Hello World!</blah>
<blah RuleTitle = "description" status = "Closed"/>
<category1/>
<category2>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Open"/>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Closed"/>
<category2/>
<category3>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Closed"/>
<blah RuleTitle = "description" status = "Open"/>
<category3/>
'''
import lxml.html
soup = lxml.html.fromstring(text)
data = soup.xpath('.//category1/blah[@status="Open"]')
for item in data:
print('Parent:', item.getparent().tag)
print('Status:', item.attrib['status'])
print('RuleTitle:', item.attrib['ruletitle']) # it needs lower case `ruletitle` instead of `RuleTitle`
print('Text:', item.text)
print('XML:', lxml.html.tostring(item))
结果:
Parent: category1
Status: Open
RuleTitle: description
Text: Hello World!
XML: b'<blah ruletitle="description" status="Open">Hello World!</blah>\n '