我有这个数据集,当3 or 6 months window
中没有足够的数据时,服务会将条目注册为NaN
。
数据集:
df = pd.DataFrame({'customer':['C1000','C1000','C1000','C1000','C1000','C1000','C1000','C1000','C2000','C2000','C2000','C2000'],
'account': ['A1100','A1100','A1100','A1200','A1200','A1300','A1300','A1300','A2100','A2100','A2100','A2100'],
'month': ['2019-10-01','2019-11-01','2019-12-01','2019-10-01','2019-11-01','2019-10-01','2019-11-01','2019-12-01','2019-09-01','2019-10-01','2019-11-01','2019-12-01'],
'invoice': [34000,55000,80000,90000,55000,10000,10000,20000,45000,78000,55000,80000]
})
预期结果是这样的:当我们没有3个月的数据时,请注意NaN
值。
+--------+-------+--------------------------+
|customer|account|avg_invoices_last_3_months|
+--------+-------+--------------------------+
|C1000 |A1100 |41,333 |
|C1000 |A1200 |NaN |
...
我尝试了这种转换:
# Count how many rows has in each month
df['cnt_month'] = df.groupby(['customer','account']).transform('count')
# At this point, both columns receive NaN value, but the invoice column can't
# be change
df.loc[df.cnt_month < 4] = 'NaN'
# Here, I need to group by customer and account for invoice values
# also NaN and numeric in the grouping like in the "The expected result"
df.groupby(['customer','account','month'])['invoice'].mean()
但是结果不起作用。
[我认为您可以用'NaN'代替,然后再加上'发票'。它符合问题的意图吗?
df.loc[df.cnt_month < 4, 'invoice'] = np.nan
df.groupby(['customer','account'])['invoice'].agg(avg_invoices_last_3_months=('invoice','mean')).reset_index()
customer account avg_invoices_last_3_months
0 C1000 A1100 NaN
1 C1000 A1200 NaN
2 C1000 A1300 NaN
3 C2000 A2100 64500.0