我想用一个条件进行分组,然后将结果反馈到原始数据帧。在这种情况下,特征“COl_COND”可以是 1 或 0,要汇总的特征是“AMMOUNT”。
下面是通过执行两个 groupby 来完成的,将结果存储在 pandas 系列中,然后合并回原始数据帧。
这可以在不进行合并的情况下完成吗?
如果内存使用是一个问题,那么应用什么方法是最合理的?
df = pd.DataFrame({'ID':[1,1,2,2,3,3,3,4,5,5,6],
'COL_COND':[1,0,1,0,1,0,1,0,1,0,0],
'AMOUNT':[5, 80,100, 50, 100, 100, 20, 1, 51, 11, 12]})
series1 = df[df.COL_COND==1].groupby('ID')['AMOUNT'].sum().rename('sum_amount_1')
series0 = df[df.COL_COND==0].groupby('ID')['AMOUNT'].sum().rename('sum_amount_0')
df = df.merge(series1.to_frame().reset_index(), on='ID', how='left')\
.merge(series0.to_frame().reset_index(), on='ID', how='left')
print(df)
ID COL_COND AMOUNT sum_amount_1 sum_amount_0
0 1 1 5 5.0000000 80
1 1 0 80 5.0000000 80
2 2 1 100 100.0000000 50
3 2 0 50 100.0000000 50
4 3 1 100 120.0000000 100
5 3 0 100 120.0000000 100
6 3 1 20 120.0000000 100
7 4 0 1 NaN 1
8 5 1 51 51.0000000 11
9 5 0 11 51.0000000 11
10 6 0 12 NaN 12
DataFrame.assign
创建辅助列,因此可以仅使用一个 groupby
与 GroupBy.transform
,如果需要缺失不存在类别的值,请使用 lambda 函数与 sum
和 min_count=1
:
out = df.join(df.assign(sum_amount_1 = df['AMOUNT'].where(df['COL_COND'].eq(1)),
sum_amount_0 = df['AMOUNT'].where(df['COL_COND'].eq(0)))
.groupby('ID')[['sum_amount_1','sum_amount_0']]
.transform(lambda x: x.sum(min_count=1)))
print(out)
ID COL_COND AMOUNT sum_amount_1 sum_amount_0
0 1 1 5 5.0 80.0
1 1 0 80 5.0 80.0
2 2 1 100 100.0 50.0
3 2 0 50 100.0 50.0
4 3 1 100 120.0 100.0
5 3 0 100 120.0 100.0
6 3 1 20 120.0 100.0
7 4 0 1 NaN 1.0
8 5 1 51 51.0 11.0
9 5 0 11 51.0 11.0
10 6 0 12 NaN 12.0