我正在寻求帮助来加快
pandas
中的滚动计算速度,这将使用预定义的最近观察的最大数量来计算滚动平均值。这是生成示例框架和框架本身的代码:
import pandas as pd
import numpy as np
tmp = pd.DataFrame(
[
[11.1]*3 + [12.1]*3 + [13.1]*3 + [14.1]*3 + [15.1]*3 + [16.1]*3 + [17.1]*3 + [18.1]*3,
['A', 'B', 'C']*8,
[np.nan]*6 + [1, 1, 1] + [2, 2, 2] + [3, 3, 3] + [np.nan]*9
],
index=['Date', 'Name', 'Val']
)
tmp = tmp.T.pivot(index='Date', columns='Name', values='Val')
Name A B C
Date
11.1 NaN NaN NaN
12.1 NaN NaN NaN
13.1 1 1 1
14.1 2 2 2
15.1 3 3 3
16.1 NaN NaN NaN
17.1 NaN NaN NaN
18.1 NaN NaN NaN
我想得到这个结果:
Name A B C
Date
11.1 NaN NaN NaN
12.1 NaN NaN NaN
13.1 1.0 1.0 1.0
14.1 1.5 1.5 1.5
15.1 2.5 2.5 2.5
16.1 2.5 2.5 2.5
17.1 3.0 3.0 3.0
18.1 NaN NaN NaN
我尝试了以下代码,它可以工作,但对于我在实践中坚持使用的数据集来说,它的性能非常糟糕。
tmp.rolling(window=3, min_periods=1).apply(lambda x: x[~np.isnan(x)][-2:].mean(), raw=True)
上面的计算应用于 3k x 50k 帧大约需要 20 分钟...也许有一种更优雅、更快的方法来获得相同的结果?也许使用多个滚动计算结果的组合或带有
groupby
的东西?
import pandas as pd
import numpy as np
tmp = pd.DataFrame(
[
[11.1]*3 + [12.1]*3 + [13.1]*3 + [14.1]*3 + [15.1]*3 + [16.1]*3 + [17.1]*3 + [18.1]*3,
['A', 'B', 'C']*8,
[np.nan]*6 + [1, 1, 1] + [2, 2, 2] + [3, 3, 3] + [np.nan]*9
],
index=['Date', 'Name', 'Val']
)
tmp = tmp.T.pivot(index='Date', columns='Name', values='Val')
window_size = 3
max_recent_observations = 2
rolling_sum = tmp.rolling(window=window_size, min_periods=1).sum()
non_nan_count = tmp.rolling(window=window_size, min_periods=1).count()
rolling_avg = rolling_sum.sub(tmp).add(tmp).div(non_nan_count)
rolling_avg = rolling_avg.where(non_nan_count <= max_recent_observations)
print(rolling_avg)
调整了一些东西,但它们是不言自明的。我想这会像您需要的那样工作,请告诉我这是否是您正在寻找的