我有一个由以下方式生成的数据框:
df = pd.DataFrame({'date' : [*['2020-01-01']*3, *['2020-01-02']*3, *['2020-01-03']*3],
'id' : ['A1', 'A2', 'A3']*3,
'qty' : [50, 10, 20, 40, 10, 20, 40, 15, 25]
}).sort_values('date')
我想获得一个新列“delta”,它是每个日期/id 的数量差异。我以为
df['delta'] = df.groupby(['date', 'id'])['qty'].transform(lambda x: x.diff()).sort_index()
可以工作,但我得到:
date id qty variation
0 2020-01-01 A1 50 NaN
1 2020-01-01 A2 10 NaN
2 2020-01-01 A3 20 NaN
3 2020-01-02 A1 40 NaN
4 2020-01-02 A2 10 NaN
5 2020-01-02 A3 20 NaN
6 2020-01-03 A1 40 NaN
7 2020-01-03 A2 15 NaN
8 2020-01-03 A3 25 NaN
我期望得到的地方:
date id qty variation
0 2020-01-01 A1 50 NaN
1 2020-01-01 A2 10 NaN
2 2020-01-01 A3 20 NaN
3 2020-01-02 A1 40 -10
4 2020-01-02 A2 10 0
5 2020-01-02 A3 20 0
6 2020-01-03 A1 40 0
7 2020-01-03 A2 15 5
8 2020-01-03 A3 25 5
有什么建议吗?
您的方法的问题在于
transform
独立应用于每个组,因此它计算每个组内的差异,但不会计算具有相同日期的不同组之间的差异。要达到所需的结果,您可以将 groupby
与 diff
一起使用。
import pandas as pd
df = pd.DataFrame({
'date': [*['2020-01-01']*3, *['2020-01-02']*3, *['2020-01-03']*3],
'id': ['A1', 'A2', 'A3']*3,
'qty': [50, 10, 20, 40, 10, 20, 40, 15, 25]
}).sort_values('date')
df['variation'] = df.groupby('id')['qty'].diff().fillna(0)
date id qty variation
0 2020-01-01 A1 50 0.0
1 2020-01-01 A2 10 0.0
2 2020-01-01 A3 20 0.0
3 2020-01-02 A1 40 -10.0
4 2020-01-02 A2 10 0.0
5 2020-01-02 A3 20 0.0
6 2020-01-03 A1 40 0.0
7 2020-01-03 A2 15 5.0
8 2020-01-03 A3 25 5.0