如何在Python中解析类似xml的文本文件?

问题描述 投票:0回答:1

我有一个类似 XML 语言的文本文件,如下所示:

<StoryText>
                <DefaultStyle/>
                <para ALIGN="3" LINESP="10"/>
                <tab FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
                <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
                <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
                
</StoryText>

我的目标是在 python 中解析这个文件,以便能够用另一个文本替换 CH= 属性内容。

示例:

> <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit"
> FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5"
> TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1"
> TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0"
> CH="**TEXT**"/>

转变为

> <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit"
> FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5"
> TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1"
> TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0"
> CH="**REPLACEMENT TEXT**"/>

我尝试像往常一样使用带有 parse 和 getroot 方法的 xml.etree.ElementTree 库,但在这里我收到错误消息:

xml.etree.ElementTree.ParseError: no element found

出现此消息显然是因为该文件不是真正的 XML,但看起来很相似。

您知道我如何实现这一目标吗? 注意:我不允许通过更改条目文件的结构来重新格式化条目文件,因为这是一个 scribus .sla 文件

python-3.x parsing xml-parsing scribus
1个回答
0
投票

尝试:

import xml.etree.ElementTree as ET

xml_data = """\
<StoryText>
    <DefaultStyle/>
    <para ALIGN="3" LINESP="10"/>
    <tab FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0"/>
    <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
    <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
    <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="TEXT"/>
</StoryText>"""


root = ET.fromstring(xml_data)

for elem in root.iter("ITEXT"):
    if "TEXT" == elem.get("CH"):
        elem.attrib["CH"] = "REPLACEMENT TEXT"

print(ET.tostring(root, encoding="utf-8").decode("utf-8"))

打印:

<StoryText>
    <DefaultStyle />
    <para ALIGN="3" LINESP="10" />
    <tab FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" />
    <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="REPLACEMENT TEXT" />
    <ITEXT FONT="Times New Roman Bold" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="REPLACEMENT TEXT" />
    <ITEXT FONT="Times New Roman Regular" FONTSIZE="10" FEATURES="inherit" FCOLOR="Black" FSHADE="100" SCOLOR="Black" SSHADE="100" TXTSHX="5" TXTSHY="-5" TXTOUT="1" TXTULP="-0.1" TXTULW="-0.1" TXTSTP="-0.1" TXTSTW="-0.1" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="REPLACEMENT TEXT" />
</StoryText>
© www.soinside.com 2019 - 2024. All rights reserved.