每列的dask groupby分别给出错误的结果

问题描述 投票:1回答:2

我在这里用虚拟数据模拟我实际想要做的事情。我需要执行的步骤:

  1. 分别对每列进行一些转换。
  2. 进行分组操作以针对目标列汇总每个列的某些指标。

我模拟的代码。

import dask.dataframe as dd
from dask.distributed import Client, as_completed, LocalCluster

cluster = LocalCluster(processes=False)

client = Client(cluster, asynchronous=True)

csv_loc = '/Users/apple/Downloads/iris.data'
df = dd.read_csv(csv_loc) # ofcourse, u need to give aws creds here. Omitting it. Assuming u can read from s3 or otherwise.
client.persist(df)
cols = ['sepal_length', 'sepal_width' ,'petal_length' ,'petal_width', 'species']

# This is needed because I am doing some custom operation on actual data
for c in cols:
    if c != 'species':
        df[c] = df[c].map(lambda x: x*10)
client.persist(df) # Is this the trouble?

def agg_bivars(col_name):
    agg_df = df.groupby('species')[col_name].sum().compute()
    return {col_name : agg_df}

agg_futures = client.map(agg_bivars, ['sepal_length', 'sepal_width' ,'petal_length' ,'petal_width'])

for batch in as_completed(agg_futures, with_results=True).batches():
   for future, result in batch:
       print('result: {}'.format(result))


client.restart()
client.close()
cluster.close()

您可以从此link下载数据。这是一个非常标准的在线流行数据。

我得到的结果:不同列的相同结果分组依据结果。

期望结果:不同列需要不同的分组依据结果。

结果:

result: {'sepal_width': species
Iris-setosa        2503.0
Iris-versicolor    2968.0
Iris-virginica     3294.0
Name: sepal_length, dtype: float64}
result: {'sepal_length': species
Iris-setosa        2503.0
Iris-versicolor    2968.0
Iris-virginica     3294.0
Name: sepal_length, dtype: float64}
result: {'petal_width': species
Iris-setosa        2503.0
Iris-versicolor    2968.0
Iris-virginica     3294.0
Name: sepal_length, dtype: float64}
result: {'petal_length': species
Iris-setosa        2503.0
Iris-versicolor    2968.0
Iris-virginica     3294.0
Name: sepal_length, dtype: float64}

Process finished with exit code 0

如果我仅在df上进行groupby,则效果很好。但是,这里的问题是我have在对每列分别进行groupby之前,对整个df进行了一些转换。注意我在做两次client.persist(df)。我第二次这样做是因为无论我做了什么新的转换,我都希望它们继续存在,以便我可以快速查询。

python pandas dask dask-distributed
2个回答
0
投票

问题出在compute()功能中的agg_bivars

尝试以下代码:

def agg_bivars(col_name):
    agg_df = df.groupby('species')[col_name].sum()  #.compute()
    return {col_name : agg_df}

agg_futures = client.map(agg_bivars, ['sepal_length', 'sepal_width' ,'petal_length' ,'petal_width'])

for batch in as_completed(futures=agg_futures, with_results=True).batches():    
    for future, result in batch:        
        print(f'result: {list(result.values())[0].compute()}')

结果::>>

result: species
setosa        2503.0
versicolor    2968.0
virginica     3294.0
Name: sepal_length, dtype: float64
result: species
setosa        1709.0
versicolor    1385.0
virginica     1487.0
Name: sepal_width, dtype: float64
result: species
setosa         732.0
versicolor    2130.0
virginica     2776.0
Name: petal_length, dtype: float64
result: species
setosa         122.0
versicolor     663.0
virginica     1013.0
Name: petal_width, dtype: float64

0
投票

在我看来,您使事情变得过于复杂。

© www.soinside.com 2019 - 2024. All rights reserved.