我有一个奇怪的数据框,其中有截至第 30 天的理论存款金额,因此有 30 个三元组。我想将所有数据收集到一个列中,其中包含所有日期和存款,无论它们是发生在该玩家的第 1 天、第 2 天还是第 X 天。我尝试了以下代码,它给出了正确的输出,但在大型数据集上运行需要超过 120 分钟:
import pandas as pd
# Sample DataFrame with Day 1 and Day 2 data
data = {
'Day_date_1': ['2024-01-01', '2024-01-02'],
'Day_date_1_week_year': ['2024-W01', '2024-W01'],
'Day1_deposit': [100, 200],
'Day_date_2': ['2024-01-02', '2024-01-03'],
'Day_date_2_week_year': ['2024-W01', '2024-W01'],
'Day2_deposit': [150, 250],
'PLAYER_ID': [1, 2],
}
date_columns = [f'Day_date_{i}' for i in range(1, 3)] # Includes Day 1 and Day 2
deposit_columns = [f'Day{i}_deposit' for i in range(1, 3)] # Includes Day 1 and Day 2
deposit_week_columns = [f'Day_date_{i}_week_year' for i in range(1, 3)] # Includes Day 1 and Day 2
# Empty DataFrame to store results
result_df = pd.DataFrame()
# Processing the DataFrame
for date_col, week_col, value_col in zip(date_columns, deposit_week_columns, deposit_columns):
temp_df = df[[date_col, week_col, value_col, "PLAYER_ID"]]
temp_df.columns = ["deposit_date", 'deposit_week', "deposit_amount", "PLAYER_ID"]
result_df = result_df.append(temp_df, ignore_index=True)
result_df['deposit_date'] = pd.to_datetime(result_df['deposit_date'])
result_deposit = result_df.groupby(['PLAYER_ID', 'deposit_date', 'deposit_week'])['deposit_amount'].mean().reset_index().rename(columns={'deposit_amount': 'mean_deposit_amount'})
# Output the result
print(result_deposit)
有什么方法可以矢量化循环或通过多重处理加速处理吗?
我想要的输出如下:
PLAYER_ID deposit_date deposit_week mean_deposit_amount
1 2024-01-01 2024-W01 100.0
1 2024-01-02 2024-W01 150.0
2 2024-01-02 2024-W01 200.0
2 2024-01-03 2024-W01 250.0
pandas.wide_to_long
:
out = (
pd.wide_to_long(df.set_axis(df.columns.str.replace(r'_?(\d+)_(week_year|deposit)',
r'_\2_\1', regex=True),
axis=1),
stubnames=['Day_date', 'Day_date_week_year', 'Day_deposit'],
i='PLAYER_ID', j='j',
sep='_')
.reset_index('PLAYER_ID')
.rename(columns={'Day_date': 'deposit_date',
'Day_date_week_year': 'deposit_week',
'Day_deposit': 'mean_deposit_amount',
})
.sort_values(by=['PLAYER_ID', 'deposit_date'])
.reset_index(drop=True)
)
输出:
PLAYER_ID deposit_date deposit_week mean_deposit_amount
0 1 2024-01-01 2024-W01 100
1 1 2024-01-02 2024-W01 150
2 2 2024-01-02 2024-W01 200
3 2 2024-01-03 2024-W01 250