每组的滚动平均值、计数或分位数

问题描述 投票:0回答:1

我有大量毫秒粒度的刻度数据,我已将其加载到 pandas DataFrame 中。为了使处理更容易,我添加了将每个时间戳分配给年、月、周、日、小时、分钟的列。

我想要基于 4 周滚动窗口的平均值、百分位数和计数。我不能只是滚动时间戳/行,因为每周都有不同数量的时间戳(范围从 50k 到 70k,具体取决于每周活动)。这就是为什么我想将时间戳分组到各自的周中,并计算滚动周内所有时间戳的平均值、百分位和计数。

如果我首先获得平均值、百分位数、每周计数,创建一个新的 df,然后滚动它,我担心信息会丢失 - 特别是关于百分位数。

我已经尝试了以下两个代码片段(添加到现有的 df 并创建新系列):

df.groupby("week")["price"].transform(lambda x: x.rolling(4,1).mean()
_ma = df.groupby("week")["price].rolling(4,1).mean()

似乎都不起作用并返回最后 4 个时间戳/行的平均值 - 不是过去 4 周内分组的所有时间戳。

目前我找到了一种解决方法,使用循环创建字典列表,然后将其映射到 df。

虽然下面的方法可行,但我希望有一个更优雅的解决方案。

解决方案:

weeks = df["week"].unique()
window = 4

results = []

for i in range(len(weeks) - window + 1):
    try:

        weeks_group = weeks[i: i + window]
        key = weeks[i + window]
        weeks_rolling = df[df["week"].isin(weeks_group)]
        average_price = weeks_rolling["price"].mean()
        quantile = weeks_rolling["price"].quantile(0.5)
        count = weeks_rolling["price"].count()
        results.append({key:[average_price,quantile,count]})

    except: print("data not available")
python pandas group-by time-series timestamp
1个回答
0
投票
import pandas as pd

# Assuming df is your DataFrame and already includes the 'week' and 'price' columns

# Define a custom function to apply on each group
def rolling_stats(group, window=4):
    # Rolling function that calculates mean, median, and count
    rolling_groups = group.rolling(window=window, min_periods=1)
    mean = rolling_groups.mean()
    median = rolling_groups.median()
    count = rolling_groups.count()
    
    # You can adjust the format of return to match your needs
    return pd.DataFrame({
        'mean_price': mean,
        'median_price': median,
        'count_price': count
    })

# Create a sorted DataFrame by week to ensure correct rolling calculation
df_sorted = df.sort_values('week')

# Group by 'week', and apply the rolling statistics
# Using 'group_keys=False' to keep the original DataFrame index
results = df_sorted.groupby('week', group_keys=False).apply(rolling_stats)

# If you need to merge these results back into the original DataFrame:
df_merged = df.merge(results, left_index=True, right_index=True, how='left')

# Display the merged DataFrame
print(df_merged.head())
© www.soinside.com 2019 - 2024. All rights reserved.