我有大量毫秒粒度的刻度数据,我已将其加载到 pandas DataFrame 中。为了使处理更容易,我添加了将每个时间戳分配给年、月、周、日、小时、分钟的列。
我想要基于 4 周滚动窗口的平均值、百分位数和计数。我不能只是滚动时间戳/行,因为每周都有不同数量的时间戳(范围从 50k 到 70k,具体取决于每周活动)。这就是为什么我想将时间戳分组到各自的周中,并计算滚动周内所有时间戳的平均值、百分位和计数。
如果我首先获得平均值、百分位数、每周计数,创建一个新的 df,然后滚动它,我担心信息会丢失 - 特别是关于百分位数。
我已经尝试了以下两个代码片段(添加到现有的 df 并创建新系列):
df.groupby("week")["price"].transform(lambda x: x.rolling(4,1).mean()
_ma = df.groupby("week")["price].rolling(4,1).mean()
似乎都不起作用并返回最后 4 个时间戳/行的平均值 - 不是过去 4 周内分组的所有时间戳。
目前我找到了一种解决方法,使用循环创建字典列表,然后将其映射到 df。
虽然下面的方法可行,但我希望有一个更优雅的解决方案。
解决方案:
weeks = df["week"].unique()
window = 4
results = []
for i in range(len(weeks) - window + 1):
try:
weeks_group = weeks[i: i + window]
key = weeks[i + window]
weeks_rolling = df[df["week"].isin(weeks_group)]
average_price = weeks_rolling["price"].mean()
quantile = weeks_rolling["price"].quantile(0.5)
count = weeks_rolling["price"].count()
results.append({key:[average_price,quantile,count]})
except: print("data not available")
import pandas as pd
# Assuming df is your DataFrame and already includes the 'week' and 'price' columns
# Define a custom function to apply on each group
def rolling_stats(group, window=4):
# Rolling function that calculates mean, median, and count
rolling_groups = group.rolling(window=window, min_periods=1)
mean = rolling_groups.mean()
median = rolling_groups.median()
count = rolling_groups.count()
# You can adjust the format of return to match your needs
return pd.DataFrame({
'mean_price': mean,
'median_price': median,
'count_price': count
})
# Create a sorted DataFrame by week to ensure correct rolling calculation
df_sorted = df.sort_values('week')
# Group by 'week', and apply the rolling statistics
# Using 'group_keys=False' to keep the original DataFrame index
results = df_sorted.groupby('week', group_keys=False).apply(rolling_stats)
# If you need to merge these results back into the original DataFrame:
df_merged = df.merge(results, left_index=True, right_index=True, how='left')
# Display the merged DataFrame
print(df_merged.head())