UnicodeDecodeError告诉您导致错误的字符位置。我该如何显示该角色？

Question

使用类似的东西打开/读取文件时

with open(<csv_file>) as f:
    df = pandas.read_csv(f)

可能会出现错误，例如

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 1678

我知道我可以使用vscode扩展来定位位于1678的csv_file中的字符。但是有没有办法可以用python来实现它。天真的，像。

>>getCharInPosition(1678)
"The character is that position is 'x'"

甚至更好，获得行号

>>getLineNumOfCharInPosition(1678)
"The line number for the character in that position is 25"

我正在寻找一种方法来使标准UnicodeDecodeError消息比告诉我一个字符位置更有用。

Answer 1

UnicodeError在其属性中有相当多的信息。

通过捕获异常，您可以利用它来查找有问题的字节：

try:
    df = pandas.read_csv(f)
except UnicodeError as e:
    offending = e.object[e.start:e.end]
    print("This file isn't encoded with", e.encoding)
    print("Illegal bytes:", repr(offending))
    raise

为了确定行号，您可以执行类似的操作（在except子句中）：

    seen_text = e.object[:e.start]
    line_no = seent_text.count(b'\n') + 1

...但我不确定e.object是否总是一个（字节）字符串（这可能会给巨大的文件造成额外的麻烦），所以我不知道它是否总是有效。

此外，在CSV文件中，如果某些单元格中有换行符，则换行符的数量可能大于逻辑行数。

UnicodeDecodeError告诉您导致错误的字符位置。我该如何显示该角色？

问题描述投票：0回答：1

1个回答

最新问题

UnicodeDecodeError告诉您导致错误的字符位置。我该如何显示该角色？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1