有很多基于数字索引的答案,但我正在寻找一种适用于 DateTimeIndex 的解决方案,但我真的被困在这里了。我用数字索引找到的最接近的答案是 this one 但不适用于我的示例。
我想让组开始和结束为
DateTime
对于 DataFrame 列中的 n
连续值组。
样本数据:
import pandas as pd
index = pd.date_range(
start=pd.Timestamp("2023-03-20 12:00:00+0000", tz="UTC"),
end=pd.Timestamp("2023-03-20 15:00:00+0000", tz="UTC"),
freq="15Min",
)
data = {
"values_including_constant_groups": [
2.0,
1.0,
1.0,
3.0,
3.0,
3.0,
4.0,
4.0,
4.0,
2.0,
3.0,
3.0,
1.0,
],
}
df = pd.DataFrame(
index=index,
data=data,
)
print(df)
产量:
values_including_constant_groups
2023-03-20 12:00:00+00:00 2.0
2023-03-20 12:15:00+00:00 1.0
2023-03-20 12:30:00+00:00 1.0
2023-03-20 12:45:00+00:00 3.0
2023-03-20 13:00:00+00:00 3.0
2023-03-20 13:15:00+00:00 3.0
2023-03-20 13:30:00+00:00 4.0
2023-03-20 13:45:00+00:00 4.0
2023-03-20 14:00:00+00:00 4.0
2023-03-20 14:15:00+00:00 2.0
2023-03-20 14:30:00+00:00 3.0
2023-03-20 14:45:00+00:00 3.0
2023-03-20 15:00:00+00:00 1.0
期望的输出(我在这里会很灵活,但这是我的第一个想法):
values_including_constant_groups group_start group_end
2023-03-20 12:00:00+00:00 2.0 NaN NaN
2023-03-20 12:15:00+00:00 1.0 True False
2023-03-20 12:30:00+00:00 1.0 False True
2023-03-20 12:45:00+00:00 3.0 True False
2023-03-20 13:00:00+00:00 3.0 False False
2023-03-20 13:15:00+00:00 3.0 False True
2023-03-20 13:30:00+00:00 4.0 True False
2023-03-20 13:45:00+00:00 4.0 False False
2023-03-20 14:00:00+00:00 4.0 False True
2023-03-20 14:15:00+00:00 2.0 NaN NaN
2023-03-20 14:30:00+00:00 3.0 True False
2023-03-20 14:45:00+00:00 3.0 False True
2023-03-20 15:00:00+00:00 1.0 NaN NaN
所以这里只应考虑
n>=2
的组,并排除“单个”值。此外,应包括重复组。
欢迎任何提示!
c = 'values_including_constant_groups'
# Compare current with previous and previous with current row
# to flag the rows corresponding to group start and group end
s, e = df[c] != df[c].shift(), df[c] != df[c].shift(-1)
# mask the flags where both group_start and group_end
# is True on the same row, i.e where n == 1
df['group_start'], df['group_end'] = s.mask(s & e), e.mask(s & e)
结果
values_including_constant_groups group_start group_end
2023-03-20 12:00:00+00:00 2.0 NaN NaN
2023-03-20 12:15:00+00:00 1.0 True False
2023-03-20 12:30:00+00:00 1.0 False True
2023-03-20 12:45:00+00:00 3.0 True False
2023-03-20 13:00:00+00:00 3.0 False False
2023-03-20 13:15:00+00:00 3.0 False True
2023-03-20 13:30:00+00:00 4.0 True False
2023-03-20 13:45:00+00:00 4.0 False False
2023-03-20 14:00:00+00:00 4.0 False True
2023-03-20 14:15:00+00:00 2.0 NaN NaN
2023-03-20 14:30:00+00:00 3.0 True False
2023-03-20 14:45:00+00:00 3.0 False True
2023-03-20 15:00:00+00:00 1.0 NaN NaN