如何读取未知格式的文本文件并将其保存为utf-8？

Question

我有一个格式未知的文本文件，其中包含一些德语字符（元音变音）。我想用 python 打开这个文件并将其读取为“utf-8”。但是，我尝试的所有内容都会出现错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 1664: invalid continuation byte

到目前为止我尝试了什么：

open(filepath, "rb").read().decode("utf-8")

我也试过：

open(filepath, "r", "utf-8")

我知道我可以例如在记事本等文本编辑器中打开文件，当我单击“另存为”时，我可以选择文件的编码。将其保存为 utf-8 后，我当然可以通过调用

open(filepath)

用 python 处理它。但是如何只使用 python（没有文本编辑器步骤）来达到同样的效果呢？我假设我可以通过抑制错误以某种方式使解码器工作，但我不知道如何......

Answer 1

0xE4

是 ä 的

Windows-1252

编码（带有变音符号的小写字母“a”），因此看起来您的文件是 Windows-1252 编码的。

要读取 Windows-1252 编码的文件，您可以输入编码名称

cp1252

：

open(filepath, "r", "cp1252")
# or
open(filepath, "rb").read().decode("cp1252")

Answer 2

UTF-8 使用 1 到 4 个字节的任何字节来编码代码点，具体取决于代码点的重要性。你可以按照Josh Lee解决方案UnicodeDecodeError, invalid continuation byte这个解决方案由Josh Lee提供。

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

如果你想阅读文本文件，你可以按照这个：

open('sample-file.txt', mode='r', encoding='utf-8').read()

如果您想将任何内容写入文本文件，您可以按照以下步骤进行操作：

open('a-new-file.txt', mode='w', encoding='utf-8')

例子：

open('questions.txt', mode='w', encoding='utf-8').write('How to read a text file with unknown format and save it as utf-8?')

如果需要，您也可以遵循这三种类型中的任何一种。

1.

sample_text_default = open('questions.txt', encoding='utf-8').read()
print(sample_text_default)

sample_text_iso = open('sample-character-encoding.txt', encoding='iso-8859-1').read()
print(sample_text_iso)

sample_text_ascii = open('sample-character-encoding.txt', encoding='ascii').read()
print(sample_text_ascii)