UTF-8 文本文件无法通过 pandas 导入，并出现 UTF-8 编码错误

Question

我有一个从 SQL 导出为 UTF-8 的文本文件，其中包含大约 550 万行。我尝试用 Pandas/Python 读取这个文件，但是得到

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 135596: invalid continuation byte

如何解决此问题？我将文件加载到 Notepad++ 中并尝试“转换为 UTF-8”，但得到了相同的结果。我尝试使用调试器单步执行，但是 pandas 正在解析相当大的块，并且我无法准确识别哪个字符导致它阻塞。我尝试以二进制形式读取文件并检查位置 135596，但没有看到任何异常情况。

对于如何识别我们数据中的问题有什么建议吗？此时，我正在考虑进行二进制分割搜索（将数据分成两半，识别哪一半给出错误，并继续以这种方式分割，直到找到它），但它是相当多的文本。

Answer 1

您可以使用 chardet 库来解决该问题，以识别文件的编码，然后根据该编码打开它，这里是一个示例

import chardet
import pandas as pd
import sys

try:
    #detect the encoding of the file
    with open('your_file.csv', 'rb') as f: #rb = read binary
        result = chardet.detect(f.read())
        #read the file with the encoding detected
    df = pd.read_csv('your_file.csv', encoding=result['encoding']) 
        # additonal code to process the dataframe
except:    
    print("Error: ", sys.exc_info()[0], ", exception instance: "
          , sys.exc_info()[1], ", line: ", sys.exc_info()[2].tb_lineno)

UTF-8 文本文件无法通过 pandas 导入，并出现 UTF-8 编码错误

问题描述投票：0回答：1

1个回答

最新问题

UTF-8 文本文件无法通过 pandas 导入，并出现 UTF-8 编码错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1