如何通过Python在XML中插入父节点?

问题描述 投票:0回答:1

我有一个这样的XML文件,我希望每次在坐标中有一定距离时都插入标记“换行符”(例如,在此示例中,文件都不同):

 <?xml version="1.0" encoding="utf-8"?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text></text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text></text>
            </textline>
        </textbox>
    </page>
</pages>

但是,我的代码不起作用,因为我打印的树没有换行符。它应该将文本标签包装到下一个标签,例如:

<newline><text></text></newline><newline><text></text></newline>

代码是:

import xml.etree.ElementTree as ET
import lxml.etree as etree
tree = ET.parse("fe2.xml")
root = tree.getroot()
node = ET.Element('newline')


for child in root.iter():
    if child.tag == 'text':
        #print(child.tag, child.attrib.items())
        for name, value in child.attrib.items():
                if name == 'bbox':
                        value = tuple(value.split(","))
                        x1 = float(value[0])
                        x2 = float(value[2])
                        distance = x2 - x1
                        if distance > 10:
                                root.insert(3, node)
                                xml_str = ET.tostring(root, encoding='unicode')
                                print(xml_str)

我该如何进行这项工作?

python xml pdf tags elementtree
1个回答
0
投票

要完成任务,使用lxml而不是ElementTree会更容易。因此,我使用了followint导入:

import lxml.etree as etree
from lxml.builder import E

第二次导入提供了新元素的工厂。

为了使用换行符进行漂亮的打印,我按如下方式阅读源文件:

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('input.xml', parser)
root = tree.getroot()

然后运行以下主循环:

for elem in root.iter('text'):
    bbox = elem.attrib.get('bbox')
    if bbox is not None:
        tbl = bbox.split(',')
        x1 = float(tbl[0])
        x2 = float(tbl[2])
        distance = x2 - x1
        if distance < 10:
            par = elem.getparent()
            par.insert(par.index(elem) + 1, E.newline())

注意getparent方法仅在lxml中可用。

最后,您可以例如漂亮地打印生成的XML树:

print(etree.tostring(root, encoding='unicode', pretty_print=True))

获得正确的结果,并插入必要的newline元素。

© www.soinside.com 2019 - 2024. All rights reserved.