如何计算用户下单前的第一次访问日期和最后一次访问日期?
USER ID TYPE DATE
1 Visited September 14, 2020
1 Visited October 4, 2020
1 Visited October 24, 2020
1 Ordered November 1, 2020
2 Visited September 14, 2020
2 Visited October 1, 2020
3 Visited September 1, 2020
3 Visited October 4, 2020
3 Visited October 4, 2020
3 Visited October 19, 2020
3 Ordered January 1, 2021
3 Visited February 11, 2021
3 Visited February 24, 2021
3 Visited March 1, 2021
3 Ordered April 21, 2021
预期产出:
USER ID Ordered MIN DATE MAX DATE
1 1 September 14, 2020 October 24, 2020
2 0 September 14, 2020 NAT
3 1 September 1, 2020 October 19, 2020
3 2 February 11, 2021 March 1, 2021
要计算用户下单前的首次访问日期和最后一次访问日期,可以将数据按用户ID和日期排序,然后使用循环遍历每一行数据。对于每个用户,您可以跟踪首次访问日期和最后一次访问日期,并在遍历行时更新它们。当你遇到一个用户的订单时,你可以输出该用户在下订单之前的第一个和最后一个访问日期。
import pandas as pd
data = pd.DataFrame({
'USER ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'TYPE': ['Visited', 'Visited', 'Visited', 'Ordered', 'Visited', 'Visited',
'Visited', 'Visited', 'Visited', 'Visited', 'Ordered', 'Visited', 'Visited', 'Ordered'],
'DATE': ['September 14, 2020', 'October 4, 2020', 'October 24, 2020', 'November 1, 2020',
'September 14, 2020', 'October 1, 2020', 'September 1, 2020', 'October 4, 2020',
'October 4, 2020', 'October 19, 2020', 'January 1, 2021', 'February 11, 2021',
'February 24, 2021', 'April 21, 2021']})
data['DATE'] = pd.to_datetime(data['DATE'])
grouped_data = data.groupby('USER ID')
for name, group in grouped_data:
first_visited_date = None
last_visited_date = None
for index, row in group.iterrows():
if row['TYPE'] == 'Visited':
if first_visited_date is None:
first_visited_date = row['DATE']
last_visited_date = row['DATE']
elif row['TYPE'] == 'Ordered':
print(f"User {name}: First visited date = {first_visited_date}, Last visited date = {last_visited_date}")
由于日期已经排序,因此避免了 sort() 函数。
尝试:
df['DATE'] = pd.to_datetime(df['DATE'])
df_out = df.assign(grp=(df['TYPE'] == 'Ordered')[::-1].cumsum())\
.set_index(['USER ID', 'grp', 'TYPE'], append=True)['DATE']\
.unstack('TYPE')\
.groupby(['USER ID', 'grp'], sort=False)\
.agg(Ordered=('Ordered','count'),
MIN_DATE=('Visited','first'),
MAX_DATE=('Visited','last'))\
.reset_index('grp', drop=True)\
.reset_index()
df_out['MAX_DATE'] = df_out['MAX_DATE'].mask(df_out['Ordered'] == 0)
df_out['Ordered'] = df_out['Ordered'].groupby(df_out['USER ID']).cumsum()
df_out['MIN_DATE'] = df_out['MIN_DATE'].dt.strftime('%B %d, %Y')
df_out['MAX_DATE'] = df_out['MAX_DATE'].dt.strftime('%B %d, %Y')
输出:
USER ID Ordered MIN_DATE MAX_DATE
0 1 1 September 14, 2020 October 24, 2020
1 2 0 September 14, 2020 NaN
2 3 1 September 01, 2020 October 19, 2020
3 3 2 February 11, 2021 March 01, 2021