Python - 解码错误（'ascii'编解码器无法解码位置19中的字节0x94 ...）

Question

你好:)我有一个大的bin文件已被gzip（所以它是一个blabla.bin.gz）。

我需要解压缩并将其写入具有ascii格式的txt文件。这是我的代码：

import gzip

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    file_content.decode("ascii")
    output = open("new_file.txt", "w", encoding="ascii")
    output.write(file_content)
    output.close()

但我得到了这个错误：

file_content.decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 19: ordinal not in range(128)

我不是那么陌生，但格式/编码问题一直是我最大的弱点:(

拜托，你能帮帮我吗？

谢谢！！！

Answer 1

首先，没有理由解码任何东西立即以原始字节写回来。更简单（更强大）的实现可能是：

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    with open("new_file.txt", "wb") as output:  # just directly write raw bytes
        output.write(file_content)

如果你真的想要解码但不确定编码，你可以使用Latin1。每个字节在Latin1中都有效，并在相同值的unicode字符中进行转换。所以无论字节串bs是什么，bs.decode('Latin1').encode('Latin1')只是bs的副本。

最后，如果你真的需要过滤掉所有非ascii字符，你可以使用decode的error参数：

file_content = file_content.decode("ascii", errors="ignore") # just remove any non ascii byte

要么：

with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:   

    file_content = f.read()
    file_content = file_content.decode("ascii", errors="replace") #non ascii chars are
                                            # replaced with the U+FFFD replacement character
    output = open("new_file.txt", "w", encoding="ascii", errors="replace") # non ascii chars
                                                      # are replaced with a question mark "?"
    output.write(file_content)
    output.close()

Python - 解码错误（'ascii'编解码器无法解码位置19中的字节0x94 ...）

问题描述投票：0回答：1

1个回答

最新问题

Python - 解码错误（'ascii'编解码器无法解码位置19中的字节0x94 ...）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1