如何选择 user_id 行 pandas

问题描述 投票:0回答:2

如何计算用户下单前的第一次访问日期和最后一次访问日期?

USER ID TYPE    DATE
1   Visited September 14, 2020
1   Visited October 4, 2020
1   Visited October 24, 2020
1   Ordered November 1, 2020
2   Visited September 14, 2020
2   Visited October 1, 2020
3   Visited September 1, 2020
3   Visited October 4, 2020
3   Visited October 4, 2020
3   Visited October 19, 2020
3   Ordered January 1, 2021
3   Visited February 11, 2021
3   Visited February 24, 2021
3   Visited March 1, 2021
3   Ordered April 21, 2021

预期产出:

USER ID Ordered MIN DATE    MAX DATE
1   1   September 14, 2020  October 24, 2020
2   0   September 14, 2020  NAT
3   1   September 1, 2020   October 19, 2020
3   2   February 11, 2021   March 1, 2021
python pandas filter feature-extraction group
2个回答
0
投票

要计算用户下单前的首次访问日期和最后一次访问日期,可以将数据按用户ID和日期排序,然后使用循环遍历每一行数据。对于每个用户,您可以跟踪首次访问日期和最后一次访问日期,并在遍历行时更新它们。当你遇到一个用户的订单时,你可以输出该用户在下订单之前的第一个和最后一个访问日期。

import pandas as pd
data = pd.DataFrame({
'USER ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'TYPE': ['Visited', 'Visited', 'Visited', 'Ordered', 'Visited', 'Visited',
         'Visited', 'Visited', 'Visited', 'Visited', 'Ordered', 'Visited', 'Visited', 'Ordered'],
'DATE': ['September 14, 2020', 'October 4, 2020', 'October 24, 2020', 'November 1, 2020',
         'September 14, 2020', 'October 1, 2020', 'September 1, 2020', 'October 4, 2020',
         'October 4, 2020', 'October 19, 2020', 'January 1, 2021', 'February 11, 2021',
         'February 24, 2021', 'April 21, 2021']})

data['DATE'] = pd.to_datetime(data['DATE'])
grouped_data = data.groupby('USER ID')

for name, group in grouped_data:
    first_visited_date = None
    last_visited_date = None

    for index, row in group.iterrows():
        if row['TYPE'] == 'Visited':
              if first_visited_date is None:
                 first_visited_date = row['DATE']
              last_visited_date = row['DATE']
        elif row['TYPE'] == 'Ordered':
              print(f"User {name}: First visited date = {first_visited_date}, Last visited date = {last_visited_date}")

由于日期已经排序,因此避免了 sort() 函数。


0
投票

尝试:

df['DATE'] = pd.to_datetime(df['DATE'])

df_out = df.assign(grp=(df['TYPE'] == 'Ordered')[::-1].cumsum())\
           .set_index(['USER ID', 'grp', 'TYPE'], append=True)['DATE']\
           .unstack('TYPE')\
           .groupby(['USER ID', 'grp'], sort=False)\
           .agg(Ordered=('Ordered','count'), 
                MIN_DATE=('Visited','first'), 
                MAX_DATE=('Visited','last'))\
           .reset_index('grp', drop=True)\
           .reset_index()

df_out['MAX_DATE'] = df_out['MAX_DATE'].mask(df_out['Ordered'] == 0)
df_out['Ordered'] = df_out['Ordered'].groupby(df_out['USER ID']).cumsum()

df_out['MIN_DATE'] = df_out['MIN_DATE'].dt.strftime('%B %d, %Y')
df_out['MAX_DATE'] = df_out['MAX_DATE'].dt.strftime('%B %d, %Y')

输出:

   USER ID  Ordered            MIN_DATE          MAX_DATE
0        1        1  September 14, 2020  October 24, 2020
1        2        0  September 14, 2020               NaN
2        3        1  September 01, 2020  October 19, 2020
3        3        2   February 11, 2021    March 01, 2021
© www.soinside.com 2019 - 2024. All rights reserved.