请考虑此数据框:
import pandas as pd
import numpy as np
values = [0, 22, 30, 0, 20, 22, 11, 0, 13]
index = pd.date_range(start = '2023-10-1', periods = len(values))
df = pd.DataFrame({'values':values }, index = index)
df
values
2023-10-01 0
2023-10-02 22
2023-10-03 30
2023-10-04 0
2023-10-05 20
2023-10-06 22
2023-10-07 11
2023-10-08 0
2023-10-09 13
目标:创建一个新列,计算自
values
中最后一个 0 以来已经过去了多少天。
我可以使用 for 循环来做到这一点:
zero_indices = df[df['values'] == 0].index
df['days'] = np.nan
for i in range(len(zero_indices)-1):
df['days'][zero_indices[i]: zero_indices[i+1]] = range(len(df[zero_indices[i]: zero_indices[i+1]]))
df['days'][zero_indices[-1]: ] = range(len(df[zero_indices[-1]: ]))
values days
2023-10-01 0 0.00
2023-10-02 22 1.00
2023-10-03 30 2.00
2023-10-04 0 0.00
2023-10-05 20 1.00
2023-10-06 22 2.00
2023-10-07 11 3.00
2023-10-08 0 0.00
2023-10-09 13 1.00
问题:如何使用矢量化(更快)来完成此操作?
有很多方法可以做到这一点,其中一种解决方案是使用
groupby
和 cumcount
:
df['temp'] = (df.values == 0).cumsum()
df.groupby(['temp']).cumcount() # this just gives the cumulative count since the last 0 value
输出:
2023-10-01 0
2023-10-02 1
2023-10-03 2
2023-10-04 0
2023-10-05 1
2023-10-06 2
2023-10-07 3
2023-10-08 0
2023-10-09 1
Freq: D, dtype: int64