我正在尝试提高 read_csv() 的速度,然后使用 pandas 2 提高数据帧的速度。我今天尝试了 dask,read_csv() 确实非常快。但数据帧操作很慢。这是为什么?使用dask后如何提高dataframe操作的速度?
谢谢
这是pandas 2和dask的速度对比
timer_start=timeit.default_timer()
df_pyarrow=pd.read_csv('input\\'+filename,parse_dates=True,sep='\t',engine='pyarrow')
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)
print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')
Time took to Finish reading file is 172.2 seconds
Time took to Finish reading file is 2.87 minutes
import dask.dataframe as dd
timer_start=timeit.default_timer()
ddf=dd.read_csv('input\\'+filename,parse_dates=True,sep='\t',sample=1000000)
# ddf=dd.read_csv('input\\'+filename,parse_dates=True,sep='\t')
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)
print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')
Time took to Finish reading file is 4.1 seconds
Time took to Finish reading file is 0.07 minute
现在获取dataframe后,我刚刚添加了一个新列,使用pandas 2几乎是0秒,但是使用dask会花费更长的时间,这里是比较
timer_start=timeit.default_timer()
df_pyarrow['new_col']=0
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)
print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')
Time took to Finish reading file is 0.0 seconds
Time took to Finish reading file is 0.00 minutes
现在dask的dataframe中,添加新列需要6秒,甚至比read_csv()还要慢,这是为什么呢?使用dask()时如何提高dataframe操作的速度?谢谢
timer_start=timeit.default_timer()
ddf['new_col']=0
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)
print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')
Time took to Finish reading file is 6.7 seconds
Time took to Finish reading file is 0.11 minutes
我认为您错过了有关 dask 工作原理的一个重要细节。
这不会读取任何 CSV:
ddf = dd.read_csv('input\\'+filename,parse_dates=True,sep='\t',sample=1000000)
它所做的是构建一个任务图。
您可以通过继续执行其他操作来向此任务图表添加项目。例如,
.groupby
、.join
等
一般来说,你添加的任务在调用之前都不会被执行
df_in_memory = ddf.compute()
但是,某些操作会隐式调用计算。看来
ddf['new_col'] = 0
...就是其中之一。
要进行同类比较,请对所有操作进行计时。就这样我
timer_start=timeit.default_timer()
df_pyarrow = (
pd.read_csv('input\\'+filename,parse_dates=True,sep='\t',engine='pyarrow')
.assign(newcol=0)
)
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)
print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')
对
import dask.dataframe as dd
timer_start=timeit.default_timer()
ddf = (
dd.read_csv('input\\'+filename,parse_dates=True,sep='\t',sample=1000000)
.assign(newcol=0)
).compute()
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)
print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')
请注意,dask 基于任务图的方法确实需要一些开销。如果您的生产数据适合内存,我不会期望有太大的性能提升。