我有一个要迭代的XML。我需要找到特定节点的前一个节点(带有标签“ text”和属性“ bbox”)。问题是,我想指定标签是否没有“ bbox”属性,而不在乎它并获取元素。但是我不知道该怎么做。这是代码:
import lxml.etree as etree
from lxml.builder import E
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('fe3.xml', parser)
root = tree.getroot()
for x in tree.xpath('//text'):
bb = x.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
print('This: ', bb)
xPrev = x.getprevious()
bb = None
if xPrev is not None:
bb = xPrev.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
if bb is not None:
print(' Previous: ', bb)
else:
xx = bb.getprevious()
print(xx, ' No previous bbox')
为清楚起见,我的XML的结构如下(实际上更长):
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="192.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="193.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text></text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text></text>
</textline>
</textbox>
</page>
</pages>
我不清楚您要达到的目标100%。话虽这么说。
当您遍历bbox节点时,您可以简单地添加一个变量并将'previous node'bbox存储在其中。
这是我要使用的代码...如果我对要实现的目标正确的话
x_prev = None
for x in tree.xpath('//text'):
bb = x.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
print('This: ', bb)
if x_prev is not None:
print(' Previous: ', x_prev)
else:
print(' No previous bbox')
# Store this bounding box for the next loop (to be used as x_prev)
x_prev = bb
为清楚起见,此代码将替换您的整个循环