我正在用Python编写一个程序,我的目标是:
我的代码:
import os
import xml.etree.ElementTree as ET
from googletrans import Translator
with open("input file.txt", "r", encoding='utf-8') as input_file:
with open("output file.txt", "w", encoding='utf-8') as output_file:
# Read input file
for ligne in input_file:
# line parse
root = ET.fromstring(ligne)
# Change CH attribute value, translate from french fr to portugese pt
current_text= root.get("CH")
translator = Translator()
translated_text = translator.translate(dest="pt", src="fr", text=current_text)
root.attrib["CH"] = translated_text.text
# convert bytes to string
decoded_string = ET.tostring(root).decode("utf-8")
# write output file
output_file.write(decoded_string)
问题是在输出文件中我得到非编码字符,例如以下输入文件:
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="la victoire est à nous"/>
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="vive l'empereur"/>
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
我得到这个结果:
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
而不是预期的结果:
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
我已检查过显示,translated_text.text 格式良好(“A vitória é nossa”),但尽管有 utf8 编码规范,decoded_string 值还是错误:
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
我不明白为什么我会得到这个结果,你能帮助我吗?
我遇到了
googletrans
错误,但以下演示了如何正确更改和编写文本:
import xml.etree.ElementTree as ET
tree = ET.parse('input file.txt')
for itext, replacement in zip(tree.iterfind('*/ITEXT'), ['A vitória é nossa', 'Vida longa ao']):
current_text = itext.get('CH')
itext.attrib['CH'] = replacement
tree.write('output file.txt', xml_declaration=True, encoding='UTF-8')
输出文件.txt
<?xml version='1.0' encoding='UTF-8'?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0" />
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
<para ALIGN="1" LINESP="10" />
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
<trail ALIGN="1" LINESP="10" />
</StoryText>
</SCRIBUSUTF8NEW>