说我有一个如下的pandas数据框
df = pd.DataFrame({'val': [30, 40, 50, 60, 70, 80, 90], 'idx': [9, 8, 7, 6, 5, 4, 3],
'category': ['a', 'a', 'b', 'b', 'c', 'c', 'c']}).set_index('idx')
Ouput:
val category
idx
9 30 a
8 40 a
7 50 b
6 60 b
5 70 c
4 80 c
3 90 c
我想添加一个新列,其中前一个类别的每个“ val”和最后一个“ val”之间的差异。新列应如下所示:
category diff val
idx
9 a nan 30
8 a nan 40
7 b 10 50
6 b 20 60
5 c 10 70
4 c 20 80
3 c 30 90
目前我是这样的:
temp_df = df.groupby('category')['val'].agg('last').rename('lastVal').shift()
df = df.merge(temp_df, on='date', how='outer', right_index=True)
df['diff'] = df['val'] - df['lastVal']
但是很慢。有更好的方法吗?
这大约是速度的两倍:
%%timeit
maxdf = df.groupby('category')['val'].last().shift()
df['diff'] = df['val'] - df['category'].map(maxdf.to_dict())
1.33 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vs您的版本
%%timeit
temp_df = df.groupby('category')['val'].agg('last').rename('lastVal').shift()
df2 = df.merge(temp_df, on='category', how='outer', right_index=True)
df2['diff'] = df2['val'] - df2['lastVal']
2.79 ms ± 83.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)