在python dask中使用分隔符读取csv

问题描述 投票:2回答:2

我试图通过读取由'#####'5哈希分隔的csv文件来创建DataFrame

代码是:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()

错误是:

dask.async.ValueError:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns.  These first 1,000 rows led us to an incorrect
guess.

For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.

You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.

    df = dd.read_csv(..., dtype={'my-column': float})

Pandas has given us the following error when trying to parse the file:

  "The 'dtype' option is not supported with the 'python' engine"

Traceback
 ---------
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)

那么如何摆脱它。

如果我遵循错误,那么我将不得不为每列提供dtype,但如果我有100多列,那么这是没用的。

如果我在没有分隔符的情况下阅读,那么一切都很顺利,但到处都有#####。所以在计算它到pandas DataFrame后,有没有办法摆脱它?

所以帮助我。

python csv separator dask
2个回答
4
投票

dtype=object的形式读取整个文件,这意味着所有列都将被解释为object类型。这应该正确读取,摆脱每行中的#####。从那里你可以使用compute()方法将它变成一个熊猫框架。一旦数据在pandas框架中,您就可以使用pandas infer_objects方法更新类型而不必刻苦。

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()

0
投票

如果你想将整个文件保存为dask数据帧,我只需通过增加read_csv中采样的字节数,就可以获得一个包含大量列的数据集。

例如:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv', sep='#####', sample = 1000000) # increase to 1e6 bytes
df.head()

这可以解决一些类型的推理问题,虽然不像Benjamin Cohen的回答,你需要找到正确的值来选择样本/

© www.soinside.com 2019 - 2024. All rights reserved.