我有一个这样的XML文件,我希望每次在坐标中有一定距离时都插入标记“换行符”(例如,在此示例中,文件都不同):
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text></text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text></text>
</textline>
</textbox>
</page>
</pages>
但是,我的代码不起作用,因为我打印的树没有换行符。它应该将文本标签包装到下一个标签,例如:
<newline><text></text></newline><newline><text></text></newline>
等
代码是:
import xml.etree.ElementTree as ET
import lxml.etree as etree
tree = ET.parse("fe2.xml")
root = tree.getroot()
node = ET.Element('newline')
for child in root.iter():
if child.tag == 'text':
#print(child.tag, child.attrib.items())
for name, value in child.attrib.items():
if name == 'bbox':
value = tuple(value.split(","))
x1 = float(value[0])
x2 = float(value[2])
distance = x2 - x1
if distance > 10:
root.insert(3, node)
xml_str = ET.tostring(root, encoding='unicode')
print(xml_str)
我该如何进行这项工作?
要完成任务,使用lxml而不是ElementTree会更容易。因此,我使用了followint导入:
import lxml.etree as etree
from lxml.builder import E
第二次导入提供了新元素的工厂。
为了使用换行符进行漂亮的打印,我按如下方式阅读源文件:
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('input.xml', parser)
root = tree.getroot()
然后运行以下主循环:
for elem in root.iter('text'):
bbox = elem.attrib.get('bbox')
if bbox is not None:
tbl = bbox.split(',')
x1 = float(tbl[0])
x2 = float(tbl[2])
distance = x2 - x1
if distance < 10:
par = elem.getparent()
par.insert(par.index(elem) + 1, E.newline())
注意getparent方法仅在lxml中可用。
最后,您可以例如漂亮地打印生成的XML树:
print(etree.tostring(root, encoding='unicode', pretty_print=True))
获得正确的结果,并插入必要的newline元素。