我有一个简单的数据集,我想对其进行滚动操作。数据集如下所示:
user_id create_at v1
0 15991247 2022-05-31 21:00:21.150822 0.028059
1 24062521 2022-05-31 21:00:33.620000 1.781399
2 12610025 2022-05-31 21:01:30.349000 0.952400
3 24062521 2022-05-31 21:02:38.836000 1.571899
4 24062521 2022-05-31 21:02:44.156000 0.952600
当我尝试进行滚动操作时,
d1 = d1.set_index('create_at')
d1.groupby('user_id')['v1'].rolling('3D').sum().reset_index(drop = True)
我得到正确答案:
0 -0.083971
1 -0.139826
2 1.941623
3 1.169590
4 0.313641
...
但是当我尝试将其附加回数据集时,一切都变成 NaN,即
d1['roll_3D'] = d1.groupby('user_id')['v1'].rolling('3D').sum().reset_index(drop = True)
user_id v1 roll_3D
create_at
2022-05-31 21:00:21.150822 15991247 0.028059 NaN
2022-05-31 21:00:33.620000 24062521 1.781399 NaN
2022-05-31 21:01:30.349000 12610025 0.952400 NaN
2022-05-31 21:02:38.836000 24062521 1.571899 NaN
2022-05-31 21:02:44.156000 24062521 0.952600 NaN
问题是因为您将
create_at
设置为数据帧的索引,当您计算滚动操作并应用 reset_index
时,它会创建一个新的默认整数索引,该索引与数据帧的索引不匹配,因此失败将值分配回数据框。要解决此问题,请尝试使用 transform
代替:
df['roll_3D'] = df.groupby('user_id')['v1'].transform(lambda x: x.rolling('3D').sum())