根据今天的相对增量获取活动计数

问题描述 投票:0回答:1

我有一个如下所示的数据框,大约有一百万行具有唯一的 person_id

+-----------+------------+----------+
| person_id |    date    | activity |
+-----------+------------+----------+
| A         | 31/03/2022 | Sell     |
| A         | 02/03/2023 | Buy      |
| A         | 29/08/2023 | Buy      |
| A         | 13/05/2023 | Buy      |
| A         | 28/02/2023 | Sell     |
| A         | 02/01/2024 | Sell     |
+-----------+------------+----------+

我想根据活动创建到“今天”的时间计算活动计数,并根据 person_id 对它们进行分组。新的栏目将分别为每次买入和卖出的 12 个月、9 个月、6 个月和 3 个月。

例如,如果有人问,A 在过去 3 个月(从“今天”开始)有多少“购买”,那么我们应该能够回答零。如果被问到关于“出售”的问题,我们应该说一个。

这里,长度是从当前日期计算的,当前日期是脚本运行当天的系统日期。

输出应如下所示。

+-----------+------------+----------+---------+--------+--------+--------+----------+---------+---------+---------+
| person_id |    date    | activity | buy_12m | buy_9m | buy_6m | buy_3m | sell_12m | sell_9m | sell_6m | sell_3m |
+-----------+------------+----------+---------+--------+--------+--------+----------+---------+---------+---------+
| A         | 31/03/2022 | Sell     |       3 |      2 |      1 |      0 |        2 |       1 |       1 |       1 |
| A         | 02/03/2023 | Buy      |       3 |      2 |      1 |      0 |        2 |       1 |       1 |       1 |
| A         | 29/08/2023 | Buy      |       3 |      2 |      1 |      0 |        2 |       1 |       1 |       1 |
| A         | 13/05/2023 | Buy      |       3 |      2 |      1 |      0 |        2 |       1 |       1 |       1 |
| A         | 28/02/2023 | Sell     |       3 |      2 |      1 |      0 |        2 |       1 |       1 |       1 |
| A         | 02/01/2024 | Sell     |       3 |      2 |      1 |      0 |        2 |       1 |       1 |       1 |
+-----------+------------+----------+---------+--------+--------+--------+----------+---------+---------+---------+

输入行的计数将与输出相同,我有点将 group_by 结果“爆炸”到每一行。我对这样的重复很满意,因为它连接到另一个需要相同行数的系统。

我按照自己的方式做到了这一点,但一直在使用relativedelta和大量的joins和group_by,所以最终得到了很长的代码,需要时间来执行。

python python-3.x pandas dataframe group-by
1个回答
0
投票

试试这个:

import pandas as pd
import numpy as np
import datetime
import itertools
# +-----------+------------+----------+
# | person_id |    date    | activity |
# +-----------+------------+----------+
# | A         | 31/03/2022 | Sell     |
# | A         | 02/03/2023 | Buy      |
# | A         | 29/08/2023 | Buy      |
# | A         | 13/05/2023 | Buy      |
# | A         | 28/02/2023 | Sell     |
# | A         | 02/01/2024 | Sell     |
# +-----------+------------+----------+

data = {
    'person_id': ['A'] * 6,
    'date': ['2022-03-31', '2023-03-02', '2023-08-29', '2023-05-13', '2023-02-15', '2024-01-02'],
    'activity': ['Sell'] + ['Buy'] * 3 + ['Sell'] * 2
}

df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
df['today'] = pd.to_datetime('2024-01-04')
df['nr_of_months_passed'] = (df.today.dt.to_period('M') - df.date.dt.to_period('M')).apply(lambda x: x.n)
df['nr_of_months_passed_r'] = np.round(df['nr_of_months_passed'] / 3, 0) * 3

def calc_cols(df):
    for action, month in itertools.product(['Buy', 'Sell'], [3,6,9,12]):        
        df[f'{action.lower()}_{month}m'] = ((df['nr_of_months_passed'] <= month) & (df['activity'] == action)).sum()
    return df

df.groupby('person_id').apply(calc_cols)

返回:

© www.soinside.com 2019 - 2024. All rights reserved.