在 pandas 中使用多个输出进行分组的简单方法

问题描述 投票:0回答:1

我是 SAS/SQL 的长期用户,并且一直默认为我的 groupbys 使用 SQL

例如要做

select region
,case when age < 5 then 'Low'
when age >= 5 and age <= 10 then 'Middle'
else 'High' as duration
,sum(1) as total
,sum(profit) as profit
,sum(profit)/sum(1)  as avg_profit
,max(revenue) as max revenue
from table
where region not in ('A')
group by 
region,(case when age < 5 then 'Low'
when age >= 5 and age <= 10 then 'Middle'
else 'High)

我正在尝试在 Pandas 中重新创建上述内容,但我不知道如何用尽可能少的代码编写它 谁能建议一种在 Pandas 中编写此内容的有效方法,而不涉及 5 次合并和事先创建新列?

python pandas pyspark
1个回答
0
投票

您可以将聚合函数与

groupby
np.where
一起使用。试试这个:

import pandas as pd
import numpy as np

# Assuming you have a DataFrame named 'df' with columns 'region', 'age', 'profit', and 'revenue'
condition1 = df['age'] < 5
condition2 = (df['age'] >= 5) & (df['age'] <= 10)

# Create the 'duration' column using np.where
df['duration'] = np.where(condition1, 'Low',
                          np.where(condition2, 'Middle', 'High'))

# Filter out the 'A' region
df = df[df['region'] != 'A']

# Group by 'region' and 'duration', and apply aggregations
result = df.groupby(['region', 'duration']).agg(
    total=('age', 'count'),
    profit=('profit', 'sum'),
    avg_profit=('profit', 'mean'),
    max_revenue=('revenue', 'max')
)

print(result)
© www.soinside.com 2019 - 2024. All rights reserved.