df = pd.DataFrame(
{"date": [pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-05")],
"numbers": [1,2,4,4,11,7,5],
"grouper": [1, 0, 1, 0,1, 0, 0]
}
)
如果我有以下 df 并且我想获取每行日期列之前的数字值的滚动平均值,我该怎么做?例如。过去 3 天的滚动平均值,按 ["grouper", "date"] 分组
我知道我可以做这样的事情,但还没有接近解决方案-
我希望在此基础上构建解决方案
df["av"] = df.shift(1).rolling(window=3).mean()
但这不会动态变化,所以它包括今天。
我对按样本 df 的两列分组的 3 天窗口的新 av 列的预期输出是
date numbers grouper av
0 2022-01-01 1 1 NaN
1 2022-01-01 2 0 NaN
2 2022-01-01 4 1 NaN
3 2022-01-03 4 0 2.0
4 2022-01-03 11 1 2.5
5 2022-01-03 7 0 2.0
6 2022-01-05 5 0 5.5
根据定义,您需要平均值 -
sum / count
。
df1 = (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum()
.shift()
.stack()
)
df = df.join(df1['sum'].div(df1['size']).rename('aw'), on=['date','grouper'])
print (df)
date numbers grouper aw
0 2022-01-01 1 1 NaN
1 2022-01-01 2 0 NaN
2 2022-01-01 4 1 NaN
3 2022-01-03 4 0 2.0
4 2022-01-03 11 1 2.5
5 2022-01-03 7 0 2.0
6 2022-01-05 5 0 5.5
解释:
首先聚合
sum
并按GroupBy.size
计数:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
)
sum size
date grouper
2022-01-01 0 2 1
1 5 2
2022-01-03 0 11 2
1 11 1
2022-01-05 0 5 1
DataFrame.unstack
重塑:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
)
sum size
grouper 0 1 0 1
date
2022-01-01 2.0 5.0 1.0 2.0
2022-01-03 11.0 11.0 2.0 1.0
2022-01-05 5.0 NaN 1.0 NaN
DataFrame.asfreq
添加缺失的连续日期时间:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
)
sum size
grouper 0 1 0 1
date
2022-01-01 2.0 5.0 1.0 2.0
2022-01-02 0.0 0.0 0.0 0.0
2022-01-03 11.0 11.0 2.0 1.0
2022-01-04 0.0 0.0 0.0 0.0
2022-01-05 5.0 NaN 1.0 NaN
然后使用
sum
进行滚动(处理总和和计数):
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum()
)
sum size
grouper 0 1 0 1
date
2022-01-01 2.0 5.0 1.0 2.0
2022-01-02 2.0 5.0 1.0 2.0
2022-01-03 13.0 16.0 3.0 3.0
2022-01-04 11.0 11.0 2.0 1.0
2022-01-05 16.0 11.0 3.0 1.0
DataFrame.shift
:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum()
.shift()
)
sum size
grouper 0 1 0 1
date
2022-01-01 NaN NaN NaN NaN
2022-01-02 2.0 5.0 1.0 2.0
2022-01-03 2.0 5.0 1.0 2.0
2022-01-04 13.0 16.0 3.0 3.0
2022-01-05 11.0 11.0 2.0 1.0
DataFrame.stack
重塑回来:
print (df.groupby(['date','grouper'])['numbers']
.agg(['sum','size'])
.unstack()
.asfreq('d', fill_value=0)
.rolling(window=3, min_periods=1)
.sum()
.shift()
.stack()
)
sum size
date grouper
2022-01-02 0 2.0 1.0
1 5.0 2.0
2022-01-03 0 2.0 1.0
1 5.0 2.0
2022-01-04 0 13.0 3.0
1 16.0 3.0
2022-01-05 0 11.0 2.0
1 11.0 1.0
对于平均值划分列:
print (df1['sum'].div(df1['size']).rename('aw'))
date grouper
2022-01-02 0 2.000000
1 2.500000
2022-01-03 0 2.000000
1 2.500000
2022-01-04 0 4.333333
1 5.333333
2022-01-05 0 5.500000
1 11.000000
Name: aw, dtype: float64
并附加到原文:
df = df.join(df1['sum'].div(df1['size']).rename('aw'), on=['date','grouper'])
print (df)
date numbers grouper aw
0 2022-01-01 1 1 NaN
1 2022-01-01 2 0 NaN
2 2022-01-01 4 1 NaN
3 2022-01-03 4 0 2.0
4 2022-01-03 11 1 2.5
5 2022-01-03 7 0 2.0
6 2022-01-05 5 0 5.5
def function1(ss:pd.Series):
return ss.loc[ss.index!=ss.index.max()].mean()
df.assign(av=df.groupby('grouper').apply(lambda dd:dd
.rolling('4d',on='date').numbers.apply(function1))
.droplevel(0))
date numbers grouper av
0 2022-01-01 1 1 NaN
1 2022-01-01 2 0 NaN
2 2022-01-01 4 1 NaN
3 2022-01-03 4 0 2.0
4 2022-01-03 11 1 2.5
5 2022-01-03 7 0 2.0
6 2022-01-05 5 0 5.5