xml解析、属性值改变、文件写入时如何保持UTF8编码? (蟒蛇)

问题描述 投票:0回答:1

我正在用Python编写一个程序,我的目标是:

  • 一次一行读取输入 xml 文件
  • 对于每一行查找“CH”属性
  • 更改属性值:从法语翻译为葡萄牙语
  • 将更改的行写入输出 xml 文件
  • 当我处理各种语言的文本时,我想保留 utf8 编码以在输出文件中显示外国特殊字符

我的代码:

import os
import xml.etree.ElementTree as ET
from googletrans import Translator



        with open("input file.txt", "r", encoding='utf-8') as input_file:
            with open("output file.txt", "w", encoding='utf-8') as output_file:
                # Read input file
                for ligne in input_file:
                        # line parse
                        root = ET.fromstring(ligne)

                        # Change CH attribute value, translate from french fr to portugese pt
                        current_text= root.get("CH")
                        translator = Translator()
                        translated_text = translator.translate(dest="pt", src="fr", text=current_text)
                        root.attrib["CH"] = translated_text.text

                        # convert bytes to string 
                        decoded_string = ET.tostring(root).decode("utf-8")
                        
                        # write output file
                        output_file.write(decoded_string)

问题是在输出文件中我得到非编码字符,例如以下输入文件:

<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="la victoire est à nous"/>
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="vive l'empereur"/>
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

我得到这个结果:

<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vit&#243;ria &#233; nossa" />                                          
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />                 
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

而不是预期的结果:


<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" /> 
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />                 
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

我已检查过显示,translated_text.text 格式良好(“A vitória é nossa”),但尽管有 utf8 编码规范,decoded_string 值还是错误:

<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vit&#243;ria &#233; nossa" />

我不明白为什么我会得到这个结果,你能帮助我吗?

python utf-8 type-conversion googletrans
1个回答
0
投票

我遇到了

googletrans
错误,但以下演示了如何正确更改和编写文本:

import xml.etree.ElementTree as ET

tree = ET.parse('input file.txt')
for itext, replacement in zip(tree.iterfind('*/ITEXT'), ['A vitória é nossa', 'Vida longa ao']):
    current_text = itext.get('CH')
    itext.attrib['CH'] = replacement
tree.write('output file.txt', xml_declaration=True, encoding='UTF-8')

输出文件.txt

<?xml version='1.0' encoding='UTF-8'?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0" />
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
                <para ALIGN="1" LINESP="10" />
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
                <trail ALIGN="1" LINESP="10" />
        </StoryText>
</SCRIBUSUTF8NEW>
© www.soinside.com 2019 - 2024. All rights reserved.