当均数不够时设置NaN

问题描述 投票:0回答:1

我有这个数据集,当3 or 6 months window中没有足够的数据时,服务会将条目注册为NaN

数据集:

df = pd.DataFrame({'customer':['C1000','C1000','C1000','C1000','C1000','C1000','C1000','C1000','C2000','C2000','C2000','C2000'],
                    'account': ['A1100','A1100','A1100','A1200','A1200','A1300','A1300','A1300','A2100','A2100','A2100','A2100'],
                    'month':   ['2019-10-01','2019-11-01','2019-12-01','2019-10-01','2019-11-01','2019-10-01','2019-11-01','2019-12-01','2019-09-01','2019-10-01','2019-11-01','2019-12-01'],
                    'invoice': [34000,55000,80000,90000,55000,10000,10000,20000,45000,78000,55000,80000]
                  })

预期结果是这样的:当我们没有3个月的数据时,请注意NaN值。

+--------+-------+--------------------------+
|customer|account|avg_invoices_last_3_months|
+--------+-------+--------------------------+
|C1000   |A1100  |41,333                    |
|C1000   |A1200  |NaN                       |
...

我尝试了这种转换:

# Count how many rows has in each month
df['cnt_month'] = df.groupby(['customer','account']).transform('count')
# At this point, both columns receive NaN value, but the invoice column can't 
# be change
df.loc[df.cnt_month < 4] = 'NaN'
# Here, I need to group by customer and account for invoice values 
# ​​also NaN and numeric in the grouping like in the "The expected result"
df.groupby(['customer','account','month'])['invoice'].mean()

但是结果不起作用。

python pandas pyspark window-functions moving-average
1个回答
1
投票

[我认为您可以用'NaN'代替,然后再加上'发票'。它符合问题的意图吗?

df.loc[df.cnt_month < 4, 'invoice'] = np.nan
df.groupby(['customer','account'])['invoice'].agg(avg_invoices_last_3_months=('invoice','mean')).reset_index()

    customer    account avg_invoices_last_3_months
0   C1000   A1100   NaN
1   C1000   A1200   NaN
2   C1000   A1300   NaN
3   C2000   A2100   64500.0
© www.soinside.com 2019 - 2024. All rights reserved.