polars 无法打开 csv 文件(带有中文字符)错误“无效的 utf-8 数据”

问题描述 投票:0回答:0

我尝试使用 polars 来完成这项工作。但是在我成功使用 pandas 时,第一步无法打开我的 Excel 文件(带有中文字符)。错误显示“CSV 中的无效 utf8 数据”。

我从网上学习并尝试了一些方法(编码和解码方法)但仍然失败。

如何在不忽略错误的情况下使用 ploars 打开下面链接的文件,因为忽略错误会导致进一步的问题。非常感谢。

这是csv文件。 (118kb) test.csv

大熊猫成功

testfile = 'D:\PythonStudyItem\pythonProject\WorkProject\Downloads\\test.csv'
df = pd.read_excel(testfile)
print(df)

结果:

D:\ProgramFiles\Python310\python.exe D:\PythonStudyItem\pythonProject\WorkProject\test.py 
            计划行号      物料代码  ... 上级供应商代码 上级供应商名称
0   JH2205000296  C1533504  ...                
1   JH2205000376  C1535878  ...     NaN     NaN
2   JH2205000377  C1625893  ...     NaN     NaN
3   JH2205000378  C1653781  ...     NaN     NaN
4   JH2205000379  C1535880  ...     NaN     NaN
..           ...       ...  ...     ...     ...
94  JH2205033960  C1571447  ...     NaN     NaN
95  JH2205033961  C1571441  ...     NaN     NaN
96  JH2205033962  C1566737  ...     NaN     NaN
97  JH2205034005  C2278945  ...     NaN     NaN
98  JH2205034006  C1571445  ...     NaN     NaN

[99 rows x 56 columns]

Process finished with exit code 0

使用 polars.read_csv() 和 polars.read_excel() 失败

testfile = 'D:\PythonStudyItem\pythonProject\WorkProject\Downloads\\test.csv'
df = pl.read_csv(testfile)
print(df)

结果:

D:\ProgramFiles\Python310\python.exe D:\PythonStudyItem\pythonProject\WorkProject\test.py 
Traceback (most recent call last):
  File "D:\PythonStudyItem\pythonProject\WorkProject\test.py", line 8, in <module>
    df = pl.read_csv(testfile)
  File "D:\ProgramFiles\Python310\lib\site-packages\polars\utils.py", line 431, in wrapper
    return fn(*args, **kwargs)
  File "D:\ProgramFiles\Python310\lib\site-packages\polars\io.py", line 379, in read_csv
    df = DataFrame._read_csv(
  File "D:\ProgramFiles\Python310\lib\site-packages\polars\internals\dataframe\frame.py", line 768, in _read_csv
    self._df = PyDataFrame.read_csv(
exceptions.ComputeError: invalid utf8 data in csv

Process finished with exit code 1

解码和 incode 后 polars.read_csv() 失败

import polars as pl
import pandas as pd
import chardet
import codecs
import xlwt

testfile = 'D:\PythonStudyItem\pythonProject\WorkProject\Downloads\\test.csv'
content = codecs.open(testfile, 'rb').read()
source_encoding = chardet.detect(content)['encoding']
print(source_encoding)

with open(testfile, 'r', encoding='gb18030') as fh:
    df = pl.read_csv(fh.read().encode('utf-8'))
    print(df)

结果:

D:\ProgramFiles\Python310\python.exe D:\PythonStudyItem\pythonProject\WorkProject\test.py 
None
Traceback (most recent call last):
  File "D:\PythonStudyItem\pythonProject\WorkProject\test.py", line 13, in <module>
    df = pl.read_csv(fh.read().encode('utf-8'))
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xb1 in position 5: illegal multibyte sequence

Process finished with exit code 1

我尝试了 encoding='gb18030', 'big5', 'latin1' 等,但仍然失败。

如果我以 errors='ignore' 打开,它会处理混乱的代码,并会导致更多问题。

with open(testfile, 'r', encoding='gb18030', errors='ignore') as fh:

我尝试过的其他方法:

1.用notepade++打开csv文件,乱码,找不到正确的编码,失败。

2.使用codecs,chardet传输,失败。

def convert(filename, out_enc='utf-8-sig'):
    content = codecs.open(filename, 'rb').read()
    source_encoding = chardet.detect(content)['encoding']
    print(source_encoding)
    if source_encoding is not None:
        if source_encoding != out_enc:
            content = content.decode(source_encoding).encode(out_enc)
            codecs.open(filename, 'wb').write(content)

结果:

source_encoding is None.

3.open with pandas ->save to cvs ->open with ploars, fail.

用Microsoft Office、WPS、Pandas.read_excel打开文件都OK,我觉得polars I/O在处理混合字符数据时不太友好

在不忽略错误的情况下用 ploars 打开下面链接的文件,因为忽略错误会导致进一步的问题。谢谢你的帮助。

这是 cvs 文件。(118kb) test.csv

python codec python-polars
© www.soinside.com 2019 - 2024. All rights reserved.