是否有办法在数据帧上进行大熊猫分组汇总,并从列中返回某个字符串?我有一个像这样的数据框:
lst = [[ 100, 'buicks', .001, np.nan, np.nan], [101, 'chevy', .002, np.nan, np.nan],
[102, 'olds', .003, .006, np.nan], [100, 'buick', np.nan, .004, np.nan],
[103, 'nissan', np.nan, np.nan, .1], [103,'nissans', np.nan, .14, np.nan]]
df = pd.DataFrame(lst, columns=['car_id', 'name', 'aa', 'bb', 'cc'])
car_id name aa bb cc
0 100 buicks 0.001 NaN NaN
1 101 chevy 0.002 NaN NaN
2 102 olds 0.003 0.006 NaN
3 100 buick NaN 0.004 NaN
4 103 nissan NaN NaN 0.1
5 103 nissans NaN 0.140 NaN
需要此:
0 100 buicks 0.001 0.004 NaN
1 101 chevy 0.002 NaN NaN
2 102 olds 0.003 0.006 NaN
4 103 nissans NaN 0.140 0.1
[我想做的是对car_id列进行分组,然后对aa,bb,cc列求和。但是,名称列的值可能不相同,但我需要保留其中之一。我不在乎哪一个。我当时在看:Pandas sum by groupby, but exclude certain columns并得到这样的结果:
df.groupby('car_id').agg({'aa': np.sum, 'bb': np.sum, 'cc':np.sum})
但是这将删除名称列。我假设可以将name列添加到上述语句中,并且可以在其中进行操作以返回字符串。
谢谢
是,确实可以在first
列中使用name
:
df.groupby('car_id').agg({'name':'first',
'aa':'sum',
'bb':'sum',
'cc':'sum'})
输出:
name aa bb cc
car_id
100 buicks 0.001 0.004 0.0
101 chevy 0.002 0.000 0.0
102 olds 0.003 0.006 0.0
103 nissan 0.000 0.140 0.1
满足您的输出
s=df.groupby(['car_id'])[['aa','bb','cc']].sum(min_count=1)
s['name']=df.drop_duplicates('car_id').set_index('car_id').name
s
Out[185]:
aa bb cc name
car_id
100 0.001 0.004 NaN buicks
101 0.002 NaN NaN chevy
102 0.003 0.006 NaN olds
103 NaN 0.140 0.1 nissan