我有一个如下所示的数据框,大约有一百万行具有唯一的 person_id
+-----------+------------+----------+
| person_id | date | activity |
+-----------+------------+----------+
| A | 31/03/2022 | Sell |
| A | 02/03/2023 | Buy |
| A | 29/08/2023 | Buy |
| A | 13/05/2023 | Buy |
| A | 28/02/2023 | Sell |
| A | 02/01/2024 | Sell |
+-----------+------------+----------+
我想根据活动创建到“今天”的时间计算活动计数,并根据 person_id 对它们进行分组。新的栏目将分别为每次买入和卖出的 12 个月、9 个月、6 个月和 3 个月。
例如,如果有人问,A 在过去 3 个月(从“今天”开始)有多少“购买”,那么我们应该能够回答零。如果被问到关于“出售”的问题,我们应该说一个。
这里,长度是从当前日期计算的,当前日期是脚本运行当天的系统日期。
输出应如下所示。
+-----------+------------+----------+---------+--------+--------+--------+----------+---------+---------+---------+
| person_id | date | activity | buy_12m | buy_9m | buy_6m | buy_3m | sell_12m | sell_9m | sell_6m | sell_3m |
+-----------+------------+----------+---------+--------+--------+--------+----------+---------+---------+---------+
| A | 31/03/2022 | Sell | 3 | 2 | 1 | 0 | 2 | 1 | 1 | 1 |
| A | 02/03/2023 | Buy | 3 | 2 | 1 | 0 | 2 | 1 | 1 | 1 |
| A | 29/08/2023 | Buy | 3 | 2 | 1 | 0 | 2 | 1 | 1 | 1 |
| A | 13/05/2023 | Buy | 3 | 2 | 1 | 0 | 2 | 1 | 1 | 1 |
| A | 28/02/2023 | Sell | 3 | 2 | 1 | 0 | 2 | 1 | 1 | 1 |
| A | 02/01/2024 | Sell | 3 | 2 | 1 | 0 | 2 | 1 | 1 | 1 |
+-----------+------------+----------+---------+--------+--------+--------+----------+---------+---------+---------+
输入行的计数将与输出相同,我有点将 group_by 结果“爆炸”到每一行。我对这样的重复很满意,因为它连接到另一个需要相同行数的系统。
我按照自己的方式做到了这一点,但一直在使用relativedelta和大量的joins和group_by,所以最终得到了很长的代码,需要时间来执行。
试试这个:
import pandas as pd
import numpy as np
import datetime
import itertools
# +-----------+------------+----------+
# | person_id | date | activity |
# +-----------+------------+----------+
# | A | 31/03/2022 | Sell |
# | A | 02/03/2023 | Buy |
# | A | 29/08/2023 | Buy |
# | A | 13/05/2023 | Buy |
# | A | 28/02/2023 | Sell |
# | A | 02/01/2024 | Sell |
# +-----------+------------+----------+
data = {
'person_id': ['A'] * 6,
'date': ['2022-03-31', '2023-03-02', '2023-08-29', '2023-05-13', '2023-02-15', '2024-01-02'],
'activity': ['Sell'] + ['Buy'] * 3 + ['Sell'] * 2
}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
df['today'] = pd.to_datetime('2024-01-04')
df['nr_of_months_passed'] = (df.today.dt.to_period('M') - df.date.dt.to_period('M')).apply(lambda x: x.n)
df['nr_of_months_passed_r'] = np.round(df['nr_of_months_passed'] / 3, 0) * 3
def calc_cols(df):
for action, month in itertools.product(['Buy', 'Sell'], [3,6,9,12]):
df[f'{action.lower()}_{month}m'] = ((df['nr_of_months_passed'] <= month) & (df['activity'] == action)).sum()
return df
df.groupby('person_id').apply(calc_cols)