`pandas`滚动总和,窗口中有效观察的最大数量

问题描述 投票:0回答:1

我正在寻求帮助来加快

pandas
中的滚动计算速度,这将使用预定义的最近观察的最大数量来计算滚动平均值。这是生成示例框架和框架本身的代码:

import pandas as pd
import numpy as np

tmp = pd.DataFrame(
    [
        [11.1]*3 + [12.1]*3 + [13.1]*3  + [14.1]*3 + [15.1]*3 + [16.1]*3 + [17.1]*3 + [18.1]*3,
        ['A', 'B', 'C']*8,
        [np.nan]*6 + [1, 1, 1] + [2, 2, 2] + [3, 3, 3] + [np.nan]*9
    ],
    index=['Date', 'Name', 'Val']
)
tmp = tmp.T.pivot(index='Date', columns='Name', values='Val')

Name    A    B    C
Date               
11.1  NaN  NaN  NaN
12.1  NaN  NaN  NaN
13.1    1    1    1
14.1    2    2    2
15.1    3    3    3
16.1  NaN  NaN  NaN
17.1  NaN  NaN  NaN
18.1  NaN  NaN  NaN

我想得到这个结果:

Name    A    B    C
Date               
11.1  NaN  NaN  NaN
12.1  NaN  NaN  NaN
13.1  1.0  1.0  1.0
14.1  1.5  1.5  1.5
15.1  2.5  2.5  2.5
16.1  2.5  2.5  2.5
17.1  3.0  3.0  3.0
18.1  NaN  NaN  NaN

尝试的解决方案

我尝试了以下代码,它可以工作,但对于我在实践中坚持使用的数据集来说,它的性能非常糟糕。

tmp.rolling(window=3, min_periods=1).apply(lambda x: x[~np.isnan(x)][-2:].mean(), raw=True)

上面的计算应用于 3k x 50k 帧大约需要 20 分钟...也许有一种更优雅、更快的方法来获得相同的结果?也许使用多个滚动计算结果的组合或带有

groupby
的东西?

python pandas dataframe pandas-rolling
1个回答
0
投票
import pandas as pd
import numpy as np
 
tmp = pd.DataFrame(
    [
        [11.1]*3 + [12.1]*3 + [13.1]*3  + [14.1]*3 + [15.1]*3 + [16.1]*3 + [17.1]*3 + [18.1]*3,
        ['A', 'B', 'C']*8,
        [np.nan]*6 + [1, 1, 1] + [2, 2, 2] + [3, 3, 3] + [np.nan]*9
    ],
    index=['Date', 'Name', 'Val']
)
tmp = tmp.T.pivot(index='Date', columns='Name', values='Val')
 
window_size = 3
max_recent_observations = 2
 
rolling_sum = tmp.rolling(window=window_size, min_periods=1).sum()
 
non_nan_count = tmp.rolling(window=window_size, min_periods=1).count()
 
rolling_avg = rolling_sum.sub(tmp).add(tmp).div(non_nan_count)
rolling_avg = rolling_avg.where(non_nan_count <= max_recent_observations)

print(rolling_avg)

调整了一些东西,但它们是不言自明的。我想这会像您需要的那样工作,请告诉我这是否是您正在寻找的

© www.soinside.com 2019 - 2024. All rights reserved.