我尝试使用 polars 来完成这项工作。但是在我成功使用 pandas 时,第一步无法打开我的 Excel 文件(带有中文字符)。错误显示“CSV 中的无效 utf8 数据”。
我从网上学习并尝试了一些方法(编码和解码方法)但仍然失败。
如何在不忽略错误的情况下使用 ploars 打开下面链接的文件,因为忽略错误会导致进一步的问题。非常感谢。
这是csv文件。 (118kb) test.csv
大熊猫成功
testfile = 'D:\PythonStudyItem\pythonProject\WorkProject\Downloads\\test.csv'
df = pd.read_excel(testfile)
print(df)
结果:
D:\ProgramFiles\Python310\python.exe D:\PythonStudyItem\pythonProject\WorkProject\test.py
计划行号 物料代码 ... 上级供应商代码 上级供应商名称
0 JH2205000296 C1533504 ...
1 JH2205000376 C1535878 ... NaN NaN
2 JH2205000377 C1625893 ... NaN NaN
3 JH2205000378 C1653781 ... NaN NaN
4 JH2205000379 C1535880 ... NaN NaN
.. ... ... ... ... ...
94 JH2205033960 C1571447 ... NaN NaN
95 JH2205033961 C1571441 ... NaN NaN
96 JH2205033962 C1566737 ... NaN NaN
97 JH2205034005 C2278945 ... NaN NaN
98 JH2205034006 C1571445 ... NaN NaN
[99 rows x 56 columns]
Process finished with exit code 0
使用 polars.read_csv() 和 polars.read_excel() 失败
testfile = 'D:\PythonStudyItem\pythonProject\WorkProject\Downloads\\test.csv'
df = pl.read_csv(testfile)
print(df)
结果:
D:\ProgramFiles\Python310\python.exe D:\PythonStudyItem\pythonProject\WorkProject\test.py
Traceback (most recent call last):
File "D:\PythonStudyItem\pythonProject\WorkProject\test.py", line 8, in <module>
df = pl.read_csv(testfile)
File "D:\ProgramFiles\Python310\lib\site-packages\polars\utils.py", line 431, in wrapper
return fn(*args, **kwargs)
File "D:\ProgramFiles\Python310\lib\site-packages\polars\io.py", line 379, in read_csv
df = DataFrame._read_csv(
File "D:\ProgramFiles\Python310\lib\site-packages\polars\internals\dataframe\frame.py", line 768, in _read_csv
self._df = PyDataFrame.read_csv(
exceptions.ComputeError: invalid utf8 data in csv
Process finished with exit code 1
解码和 incode 后 polars.read_csv() 失败
import polars as pl
import pandas as pd
import chardet
import codecs
import xlwt
testfile = 'D:\PythonStudyItem\pythonProject\WorkProject\Downloads\\test.csv'
content = codecs.open(testfile, 'rb').read()
source_encoding = chardet.detect(content)['encoding']
print(source_encoding)
with open(testfile, 'r', encoding='gb18030') as fh:
df = pl.read_csv(fh.read().encode('utf-8'))
print(df)
结果:
D:\ProgramFiles\Python310\python.exe D:\PythonStudyItem\pythonProject\WorkProject\test.py
None
Traceback (most recent call last):
File "D:\PythonStudyItem\pythonProject\WorkProject\test.py", line 13, in <module>
df = pl.read_csv(fh.read().encode('utf-8'))
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xb1 in position 5: illegal multibyte sequence
Process finished with exit code 1
我尝试了 encoding='gb18030', 'big5', 'latin1' 等,但仍然失败。
如果我以 errors='ignore' 打开,它会处理混乱的代码,并会导致更多问题。
with open(testfile, 'r', encoding='gb18030', errors='ignore') as fh:
我尝试过的其他方法:
1.用notepade++打开csv文件,乱码,找不到正确的编码,失败。
2.使用codecs,chardet传输,失败。
def convert(filename, out_enc='utf-8-sig'):
content = codecs.open(filename, 'rb').read()
source_encoding = chardet.detect(content)['encoding']
print(source_encoding)
if source_encoding is not None:
if source_encoding != out_enc:
content = content.decode(source_encoding).encode(out_enc)
codecs.open(filename, 'wb').write(content)
结果:
source_encoding is None.
3.open with pandas ->save to cvs ->open with ploars, fail.
用Microsoft Office、WPS、Pandas.read_excel打开文件都OK,我觉得polars I/O在处理混合字符数据时不太友好
在不忽略错误的情况下用 ploars 打开下面链接的文件,因为忽略错误会导致进一步的问题。谢谢你的帮助。
这是 cvs 文件。(118kb) test.csv