在大型数据帧上进行矢量化或使用多重处理

问题描述 投票:0回答:1

我有一个奇怪的数据框,其中有截至第 30 天的理论存款金额,因此有 30 个三元组。我想将所有数据收集到一个列中,其中包含所有日期和存款,无论它们是发生在该玩家的第 1 天、第 2 天还是第 X 天。我尝试了以下代码,它给出了正确的输出,但在大型数据集上运行需要超过 120 分钟:

import pandas as pd

# Sample DataFrame with Day 1 and Day 2 data
data = {
    'Day_date_1': ['2024-01-01', '2024-01-02'],
    'Day_date_1_week_year': ['2024-W01', '2024-W01'],
    'Day1_deposit': [100, 200],
    'Day_date_2': ['2024-01-02', '2024-01-03'],
    'Day_date_2_week_year': ['2024-W01', '2024-W01'],
    'Day2_deposit': [150, 250],
    'PLAYER_ID': [1, 2],
}


date_columns = [f'Day_date_{i}' for i in range(1, 3)]  # Includes Day 1 and Day 2
deposit_columns = [f'Day{i}_deposit' for i in range(1, 3)]  # Includes Day 1 and Day 2
deposit_week_columns = [f'Day_date_{i}_week_year' for i in range(1, 3)]  # Includes Day 1 and Day 2

# Empty DataFrame to store results
result_df = pd.DataFrame()

# Processing the DataFrame
for date_col, week_col, value_col in zip(date_columns, deposit_week_columns, deposit_columns):
    temp_df = df[[date_col, week_col, value_col, "PLAYER_ID"]]
    temp_df.columns = ["deposit_date", 'deposit_week', "deposit_amount", "PLAYER_ID"]
    result_df = result_df.append(temp_df, ignore_index=True)

result_df['deposit_date'] = pd.to_datetime(result_df['deposit_date'])
result_deposit = result_df.groupby(['PLAYER_ID', 'deposit_date', 'deposit_week'])['deposit_amount'].mean().reset_index().rename(columns={'deposit_amount': 'mean_deposit_amount'})

# Output the result
print(result_deposit)

有什么方法可以矢量化循环或通过多重处理加速处理吗?

我想要的输出如下:

PLAYER_ID   deposit_date    deposit_week    mean_deposit_amount
1              2024-01-01     2024-W01            100.0
1              2024-01-02     2024-W01            150.0
2              2024-01-02     2024-W01            200.0
2              2024-01-03     2024-W01            250.0

python pandas multiprocessing vectorization
1个回答
0
投票

我会用

pandas.wide_to_long

out = (
 pd.wide_to_long(df.set_axis(df.columns.str.replace(r'_?(\d+)_(week_year|deposit)',
                                                    r'_\2_\1', regex=True),
                             axis=1),
                 stubnames=['Day_date', 'Day_date_week_year', 'Day_deposit'],
                 i='PLAYER_ID', j='j',
                 sep='_')
   .reset_index('PLAYER_ID')
   .rename(columns={'Day_date': 'deposit_date',
                    'Day_date_week_year': 'deposit_week',
                    'Day_deposit': 'mean_deposit_amount',
                   })
   .sort_values(by=['PLAYER_ID', 'deposit_date'])
   .reset_index(drop=True)
)

输出:

   PLAYER_ID deposit_date deposit_week  mean_deposit_amount
0          1   2024-01-01     2024-W01                  100
1          1   2024-01-02     2024-W01                  150
2          2   2024-01-02     2024-W01                  200
3          2   2024-01-03     2024-W01                  250
© www.soinside.com 2019 - 2024. All rights reserved.