如何在组内进行分组和过滤pandas

Question

我希望向 DataFrame 添加一个聚合列，如下所示：

pd.DataFrame({"col1" : [1, 2, 3, 4, 5], "col2" : ["b", "b", "b", "a", "b"], "col3": [6, 2, 11, 1, 3]})

当我聚合时，我想过滤列值。因此，对于上面每行的示例 DataFrame，我希望找到“col1”值的平均值，该值与“col2”值相同，但值低于“col3”值。

因此，对于第 0 行，需要 (2, 5) = 3.5 的平均值

我的预期结果如下所示：

pd.DataFrame({"col1" : [1, 2, 3, 4, 5], "col2" : ["b", "b", "b", "a", "b"], "col3": [6, 2, 11, 1, 3], "mean_col" : [3.5, None, 2.666, None, 2.0]})

我可以做这样的事情，但我不想迭代 df:

mean_vals = []
for index, row in df.iterrows():
    # ensure same group
    grouped_df = df[df.col2 == row["col2"]]
    
    # apply condition
    grouped_and_filtered_df = grouped_df[grouped_df.col3 < row["col3"]]
    
    # aggregation
    row_mean = grouped_and_filtered_df.col1.mean()
    
    mean_vals.append(row_mean)

Answer 1

您可以

pivot_table

您的数据，聚合为

sum

和

count

，

sort_values

by

col3

，然后

cumsum

和

shift

来获得总和和计数较低“col3”的所有值。现在，您将能够通过将

sum

除以

count

和

merge

将

stack

输出到原始数据来计算平均值：

tmp = (df.pivot_table(index='col3', columns='col2',
                      values='col1', aggfunc=['sum', 'count'])
         .sort_index().cumsum().shift()
       
      )

out = df.merge(tmp['sum'].div(tmp['count']).stack()
                         .reset_index(name='mean_col'),
               how='left')

输出：

   col1 col2  col3  mean_col
0     1    b     6  3.500000
1     2    b     2       NaN
2     3    b    11  2.666667
3     4    a     1       NaN
4     5    b     3  2.000000

中级

tmp

:

      sum      count     
col2    a    b     a    b
col3                     
1     NaN  NaN   NaN  NaN
2     4.0  NaN   1.0  NaN
3     NaN  2.0   NaN  1.0
6     NaN  7.0   NaN  2.0
11    NaN  8.0   NaN  3.0

如何在组内进行分组和过滤pandas

问题描述投票：0回答：1

1个回答

最新问题

如何在组内进行分组和过滤pandas

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1