Pandas 分组后日期的滚动平均值

问题描述 投票:0回答:2
df = pd.DataFrame(
    {"date": [pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-01"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-03"), pd.Timestamp("2022-01-05")],
    "numbers": [1,2,4,4,11,7,5],
    "grouper": [1, 0, 1, 0,1, 0, 0]
    }
)

如果我有以下 df 并且我想获取每行日期列之前的数字值的滚动平均值,我该怎么做?例如。过去 3 天的滚动平均值,按 ["grouper", "date"] 分组

我知道我可以做这样的事情,但还没有接近解决方案-

我希望在此基础上构建解决方案

df["av"] = df.shift(1).rolling(window=3).mean()

但这不会动态变化,所以它包括今天。

我对按样本 df 的两列分组的 3 天窗口的新 av 列的预期输出是

    date    numbers grouper av
0   2022-01-01  1   1   NaN
1   2022-01-01  2   0   NaN
2   2022-01-01  4   1   NaN
3   2022-01-03  4   0   2.0
4   2022-01-03  11  1   2.5
5   2022-01-03  7   0   2.0
6   2022-01-05  5   0   5.5
python pandas group-by
2个回答
1
投票

根据定义,您需要平均值 -

sum / count

df1 = (df.groupby(['date','grouper'])['numbers']
         .agg(['sum','size'])
         .unstack()
         .asfreq('d', fill_value=0)
         .rolling(window=3, min_periods=1)
         .sum()
         .shift()
         .stack()
         )

df = df.join(df1['sum'].div(df1['size']).rename('aw'), on=['date','grouper'])
print (df)
        date  numbers  grouper   aw
0 2022-01-01        1        1  NaN
1 2022-01-01        2        0  NaN
2 2022-01-01        4        1  NaN
3 2022-01-03        4        0  2.0
4 2022-01-03       11        1  2.5
5 2022-01-03        7        0  2.0
6 2022-01-05        5        0  5.5

解释

首先聚合

sum
并按
GroupBy.size
计数:

print (df.groupby(['date','grouper'])['numbers']
             .agg(['sum','size'])
             )
                    sum  size
date       grouper           
2022-01-01 0          2     1
           1          5     2
2022-01-03 0         11     2
           1         11     1
2022-01-05 0          5     1

然后对于 DatatimeIndex 通过

DataFrame.unstack
重塑:

print (df.groupby(['date','grouper'])['numbers']
             .agg(['sum','size'])
             .unstack()
             )
             sum       size     
grouper        0     1    0    1
date                            
2022-01-01   2.0   5.0  1.0  2.0
2022-01-03  11.0  11.0  2.0  1.0
2022-01-05   5.0   NaN  1.0  NaN

通过

DataFrame.asfreq
添加缺失的连续日期时间:

print (df.groupby(['date','grouper'])['numbers']
             .agg(['sum','size'])
             .unstack()
             .asfreq('d', fill_value=0)
             )
             sum       size     
grouper        0     1    0    1
date                            
2022-01-01   2.0   5.0  1.0  2.0
2022-01-02   0.0   0.0  0.0  0.0
2022-01-03  11.0  11.0  2.0  1.0
2022-01-04   0.0   0.0  0.0  0.0
2022-01-05   5.0   NaN  1.0  NaN

然后使用

sum
进行滚动(处理总和和计数):

print (df.groupby(['date','grouper'])['numbers']
             .agg(['sum','size'])
             .unstack()
             .asfreq('d', fill_value=0)
             .rolling(window=3, min_periods=1)
             .sum()
             )
             sum       size     
grouper        0     1    0    1
date                            
2022-01-01   2.0   5.0  1.0  2.0
2022-01-02   2.0   5.0  1.0  2.0
2022-01-03  13.0  16.0  3.0  3.0
2022-01-04  11.0  11.0  2.0  1.0
2022-01-05  16.0  11.0  3.0  1.0

使用

DataFrame.shift

print (df.groupby(['date','grouper'])['numbers']
             .agg(['sum','size'])
             .unstack()
             .asfreq('d', fill_value=0)
             .rolling(window=3, min_periods=1)
             .sum()
             .shift()
             )
             sum       size     
grouper        0     1    0    1
date                            
2022-01-01   NaN   NaN  NaN  NaN
2022-01-02   2.0   5.0  1.0  2.0
2022-01-03   2.0   5.0  1.0  2.0
2022-01-04  13.0  16.0  3.0  3.0
2022-01-05  11.0  11.0  2.0  1.0

通过

DataFrame.stack
重塑回来:

print (df.groupby(['date','grouper'])['numbers']
             .agg(['sum','size'])
             .unstack()
             .asfreq('d', fill_value=0)
             .rolling(window=3, min_periods=1)
             .sum()
             .shift()
             .stack()
             )
                     sum  size
date       grouper            
2022-01-02 0         2.0   1.0
           1         5.0   2.0
2022-01-03 0         2.0   1.0
           1         5.0   2.0
2022-01-04 0        13.0   3.0
           1        16.0   3.0
2022-01-05 0        11.0   2.0
           1        11.0   1.0

对于平均值划分列:

print (df1['sum'].div(df1['size']).rename('aw'))
date        grouper
2022-01-02  0           2.000000
            1           2.500000
2022-01-03  0           2.000000
            1           2.500000
2022-01-04  0           4.333333
            1           5.333333
2022-01-05  0           5.500000
            1          11.000000
Name: aw, dtype: float64

并附加到原文:

df = df.join(df1['sum'].div(df1['size']).rename('aw'), on=['date','grouper'])
print (df)
        date  numbers  grouper   aw
0 2022-01-01        1        1  NaN
1 2022-01-01        2        0  NaN
2 2022-01-01        4        1  NaN
3 2022-01-03        4        0  2.0
4 2022-01-03       11        1  2.5
5 2022-01-03        7        0  2.0
6 2022-01-05        5        0  5.5

0
投票
def function1(ss:pd.Series):
    return ss.loc[ss.index!=ss.index.max()].mean()

df.assign(av=df.groupby('grouper').apply(lambda dd:dd
                                         .rolling('4d',on='date').numbers.apply(function1))
          .droplevel(0))


    date  numbers  grouper   av
0 2022-01-01        1        1  NaN
1 2022-01-01        2        0  NaN
2 2022-01-01        4        1  NaN
3 2022-01-03        4        0  2.0
4 2022-01-03       11        1  2.5
5 2022-01-03        7        0  2.0
6 2022-01-05        5        0  5.5
© www.soinside.com 2019 - 2024. All rights reserved.