DASK:是否有numpy.select的等效项?

问题描述 投票:0回答:2

我正在使用Dask将11m行csv加载到数据帧中并执行计算。我已经达到了需要条件逻辑的位置-如果是,则为条件逻辑,否则为其他。

例如,如果我要使用熊猫,我可以执行以下操作,其中使用numpy select语句以及一系列条件和结果。该语句大约需要35秒才能运行-不错,但不是很好:

df["AndHeathSolRadFact"] = np.select(
    [
    (df['Month'].between(8,12)),
    (df['Month'].between(1,2) & df['CloudCover']>30) #Array of CONDITIONS
    ],  #list of conditions
    [1, 1],     #Array of RESULTS (must match conditions)
    default=0)    #DEFAULT if no match

我希望做的是使用dask在一个dask数据帧中本机执行此操作,而不必先将我的[[dask数据帧转换为pandas数据帧,然后再次返回。这使我能够:-使用多线程-使用大于可用内存的数据框-可能会加快结果的速度。

样本CSV

Location,Date,Temperature,RH,WindDir,WindSpeed,DroughtFactor,Curing,CloudCover 1075,2019-20-09 04:00,6.8,99.3,143.9,5.6,10.0,93.0,1.0 1075,2019-20-09 05:00,6.4,100.0,93.6,7.2,10.0,93.0,1.0 1075,2019-20-09 06:00,6.7,99.3,130.3,6.9,10.0,93.0,1.0 1075,2019-20-09 07:00,8.6,95.4,68.5,6.3,10.0,93.0,1.0 1075,2019-20-09 08:00,12.2,76.0,86.4,6.1,10.0,93.0,1.0

最小可行样本的完整代码

import dask.dataframe as dd import dask.multiprocessing import dask.threaded import pandas as pd import numpy as np # Dataframes implement the Pandas API import dask.dataframe as dd from timeit import default_timer as timer start = timer() ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv') #Convert back to a Dask dataframe because we want that juicy parallelism ddf2 = dd.from_pandas(df,npartitions=4) del [df] print(ddf2.head()) #print(ddf.tail()) end = timer() print(end - start) #Clean up remaining dataframes del [[ddf2]
python pandas numpy dask
2个回答
0
投票
听起来您正在寻找dd.Series.where

0
投票
所以,我能想到的答案是表现最好的:

#Create a helper column where we store the value we want to set the column to later. ddf['Helper'] = 1 #Create the column where we will be setting values, and give it a default value ddf['AndHeathSolRadFact'] = 0 #Break the logic out into separate where clauses. Rather than looping we will be selecting those rows #where the conditions are met and then set the value we went. We are required to use the helper #column value because we cannot set values directly, but we can match from another column. #First, a very simple clause. If Temperature is greater than or equal to 8, make #AndHeathSolRadFact equal to the value in Helper #Note that at the end, after the comma, we preserve the existing cell value if the condition is not met ddf['AndHeathSolRadFact'] = (ddf.Helper).where(ddf.Temperature >= 8, ddf.AndHeathSolRadFact) #A more complex example #this is the same as the above, but demonstrates how to use a compound select statement where #we evaluate multiple conditions and then set the value. ddf['AndHeathSolRadFact'] = (ddf.Helper).where(((ddf.Temperature == 6.8) & (ddf.RH == 99.3)), ddf.AndHeathSolRadFact)

我是这个方面的新手,但是我认为这种方法算是矢量化。它充分利用了数组,并且评估非常快。添加新列,将其填充为0,同时评估两个select语句和替换目标行中的值,这只会对npartitions = 4的11m行数据集的处理时间增加

0.2s。以前,在大熊猫中使用类似方法大约需要45秒。

唯一要做的就是在完成后删除帮助器列。目前,我不确定如何执行此操作。

© www.soinside.com 2019 - 2024. All rights reserved.