损坏的希伯来语：另存为 ansi - 隐蔽回 UTF-8

Question

我怀疑某些数据已（在 Windows 机器上）保存为 ANSI。因此，原来的希伯来字符丢失了，我们看到的是类似的东西

ùéôåãé äòéø

。

信息是否丢失，或者是否有可能在知道原始文本是希伯来语的情况下映射回字符？

Answer 1

信息可能不会丢失，或者最多丢失部分。如果你想使用Python：

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open("input.txt", "r", "windows-1255") as sourceFile:
    with codecs.open("output.txt", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
               break
            targetFile.write(contents)

盗自并改编自How to conversion a file to utf-8 in Python?

您还可以使用外部工具，例如 iconv：

iconv -f windows-1255 -t utf-8 input.txt > output.txt

Iconv 在大多数 Linux 发行版、Cygwin 和其他平台上都可用。

如果文件被双重损坏，您可能需要执行以下操作：

iconv -f utf-8 -t windows-1252 input.txt > tmp.txt
iconv -f windows-1255 -t utf-8 tmp.txt > output.txt

但这种事情发生的可能性微乎其微。

Answer 2

我有一个非常相似的问题，文本看起来同样损坏。 online-decoder 告诉我，由于某种原因，文本被编码为

iso-8859-1

而不是

iso-8859-8

text.encode("iso-8859-1").decode("iso-8859")

损坏的希伯来语：另存为 ansi - 隐蔽回 UTF-8

问题描述投票：0回答：2

2个回答

最新问题

损坏的希伯来语：另存为 ansi - 隐蔽回 UTF-8

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2