Pandas groupby均值-放入数据框?

问题描述 投票:0回答:1

说我的数据看起来像这样:

date,name,id,dept,sale1,sale2,sale3,total_sale
1/1/17,John,50,Sales,50.0,60.0,70.0,180.0
1/1/17,Mike,21,Engg,43.0,55.0,2.0,100.0
1/1/17,Jane,99,Tech,90.0,80.0,70.0,240.0
1/2/17,John,50,Sales,60.0,70.0,80.0,210.0
1/2/17,Mike,21,Engg,53.0,65.0,12.0,130.0
1/2/17,Jane,99,Tech,100.0,90.0,80.0,270.0
1/3/17,John,50,Sales,40.0,50.0,60.0,150.0
1/3/17,Mike,21,Engg,53.0,55.0,12.0,120.0
1/3/17,Jane,99,Tech,80.0,70.0,60.0,210.0

我想要一个新列average,它是每个total_sale元组的name,id,dept的平均值

我尝试过

df.groupby(['name', 'id', 'dept'])['total_sale'].mean()

并且这确实返回了一系列均值:

name  id  dept 
Jane  99  Tech     240.000000
John  50  Sales    180.000000
Mike  21  Engg     116.666667
Name: total_sale, dtype: float64

但是我将如何引用数据?该系列是形状(3,)的一维形式。理想情况下,我希望将其放回到具有适当列的数据框中,以便可以通过name/id/dept正确引用。

python python-3.x pandas dataframe weighted-average
1个回答
22
投票

如果您在已有的序列上调用.reset_index(),它将为您提供所需的数据框(索引的每个级别将转换为列):

df.groupby(['name', 'id', 'dept'])['total_sale'].mean().reset_index()

编辑:为响应OP的评论,将此列添加回原始数据框有点棘手。您所拥有的行数与原始数据框中的行数不同,因此尚不能将其分配为新列。但是,如果将索引设置为相同,则pandas很聪明,并且会为您正确填充值。试试这个:

cols = ['date','name','id','dept','sale1','sale2','sale3','total_sale']
data = [
['1/1/17', 'John', 50, 'Sales', 50.0, 60.0, 70.0, 180.0],
['1/1/17', 'Mike', 21, 'Engg', 43.0, 55.0, 2.0, 100.0],
['1/1/17', 'Jane', 99, 'Tech', 90.0, 80.0, 70.0, 240.0],
['1/2/17', 'John', 50, 'Sales', 60.0, 70.0, 80.0, 210.0],
['1/2/17', 'Mike', 21, 'Engg', 53.0, 65.0, 12.0, 130.0],
['1/2/17', 'Jane', 99, 'Tech', 100.0, 90.0, 80.0, 270.0],
['1/3/17', 'John', 50, 'Sales', 40.0, 50.0, 60.0, 150.0],
['1/3/17', 'Mike', 21, 'Engg', 53.0, 55.0, 12.0, 120.0],
['1/3/17', 'Jane', 99, 'Tech', 80.0, 70.0, 60.0, 210.0]
]
df = pd.DataFrame(data, columns=cols)

mean_col = df.groupby(['name', 'id', 'dept'])['total_sale'].mean() # don't reset the index!
df = df.set_index(['name', 'id', 'dept']) # make the same index here
df['mean_col'] = mean_col
df = df.reset_index() # to take the hierarchical index off again

4
投票

您非常亲密。您只需要在[['total_sale']]周围添加一组括号即可告诉python选择作为数据框而不是序列:

df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()

如果要所有列:

df.groupby(['name', 'id', 'dept'], as_index=False).mean()[['name', 'id', 'dept', 'total_sale']]

3
投票

添加to_frame

df.groupby(['name', 'id', 'dept'])['total_sale'].mean().to_frame()

1
投票

答案是两行代码:

第一行创建层次框架。

df_mean = df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()

[第二行将其转换为具有四列的数据框('名称','id','部门','总计销售']

df_mean = df_mean.reset_index()
© www.soinside.com 2019 - 2024. All rights reserved.