dask read_csv 很快,但 dataframe 操作很慢

问题描述 投票:0回答:1

我正在尝试提高 read_csv() 的速度,然后使用 pandas 2 提高数据帧的速度。我今天尝试了 dask,read_csv() 确实非常快。但数据帧操作很慢。这是为什么?使用dask后如何提高dataframe操作的速度?

谢谢

这是pandas 2和dask的速度对比

  1. 使用 pandas 2 读取_csv():172 秒
timer_start=timeit.default_timer()
df_pyarrow=pd.read_csv('input\\'+filename,parse_dates=True,sep='\t',engine='pyarrow')  
timer_end=timeit.default_timer()

timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)

print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')

Time took to Finish reading file is 172.2 seconds
Time took to Finish reading file is 2.87 minutes
  1. 使用dask的read_csv(),只需要4秒
import dask.dataframe as dd
timer_start=timeit.default_timer()
ddf=dd.read_csv('input\\'+filename,parse_dates=True,sep='\t',sample=1000000)  
# ddf=dd.read_csv('input\\'+filename,parse_dates=True,sep='\t')  
timer_end=timeit.default_timer()

timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)

print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')
Time took to Finish reading file is 4.1 seconds
Time took to Finish reading file is 0.07 minute

现在获取dataframe后,我刚刚添加了一个新列,使用pandas 2几乎是0秒,但是使用dask会花费更长的时间,这里是比较

timer_start=timeit.default_timer()
df_pyarrow['new_col']=0
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)
print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')

Time took to Finish reading file is 0.0 seconds
Time took to Finish reading file is 0.00 minutes

现在dask的dataframe中,添加新列需要6秒,甚至比read_csv()还要慢,这是为什么呢?使用dask()时如何提高dataframe操作的速度?谢谢

timer_start=timeit.default_timer()
ddf['new_col']=0
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)
print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')

Time took to Finish reading file is 6.7 seconds
Time took to Finish reading file is 0.11 minutes
dask dask-dataframe
1个回答
0
投票

我认为您错过了有关 dask 工作原理的一个重要细节。

这不会读取任何 CSV:

ddf = dd.read_csv('input\\'+filename,parse_dates=True,sep='\t',sample=1000000)

它所做的是构建一个任务图。

您可以通过继续执行其他操作来向此任务图表添加项目。例如,

.groupby
.join

一般来说,你添加的任务在调用之前都不会被执行

df_in_memory = ddf.compute()

但是,某些操作会隐式调用计算。看来

ddf['new_col'] = 0

...就是其中之一。

要进行同类比较,请对所有操作进行计时。就这样我

timer_start=timeit.default_timer()
df_pyarrow = ( 
    pd.read_csv('input\\'+filename,parse_dates=True,sep='\t',engine='pyarrow')
        .assign(newcol=0)
)  
timer_end=timeit.default_timer()
timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)

print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')

import dask.dataframe as dd

timer_start=timeit.default_timer()
ddf = (
    dd.read_csv('input\\'+filename,parse_dates=True,sep='\t',sample=1000000)  
      .assign(newcol=0)
).compute()
timer_end=timeit.default_timer()

timer_minutes=(timer_end-timer_start)/60
timer_seconds=(timer_end-timer_start)

print(f'Time took to Finish reading file is {timer_seconds:.1f} seconds')
print(f'Time took to Finish reading file is {timer_minutes:.2f} minutes')

请注意,dask 基于任务图的方法确实需要一些开销。如果您的生产数据适合内存,我不会期望有太大的性能提升。

© www.soinside.com 2019 - 2024. All rights reserved.