我是 SAS/SQL 的长期用户,并且一直默认为我的 groupbys 使用 SQL
例如要做
select region
,case when age < 5 then 'Low'
when age >= 5 and age <= 10 then 'Middle'
else 'High' as duration
,sum(1) as total
,sum(profit) as profit
,sum(profit)/sum(1) as avg_profit
,max(revenue) as max revenue
from table
where region not in ('A')
group by
region,(case when age < 5 then 'Low'
when age >= 5 and age <= 10 then 'Middle'
else 'High)
我正在尝试在 Pandas 中重新创建上述内容,但我不知道如何用尽可能少的代码编写它 谁能建议一种在 Pandas 中编写此内容的有效方法,而不涉及 5 次合并和事先创建新列?
您可以将聚合函数与
groupby
和 np.where
一起使用。试试这个:
import pandas as pd
import numpy as np
# Assuming you have a DataFrame named 'df' with columns 'region', 'age', 'profit', and 'revenue'
condition1 = df['age'] < 5
condition2 = (df['age'] >= 5) & (df['age'] <= 10)
# Create the 'duration' column using np.where
df['duration'] = np.where(condition1, 'Low',
np.where(condition2, 'Middle', 'High'))
# Filter out the 'A' region
df = df[df['region'] != 'A']
# Group by 'region' and 'duration', and apply aggregations
result = df.groupby(['region', 'duration']).agg(
total=('age', 'count'),
profit=('profit', 'sum'),
avg_profit=('profit', 'mean'),
max_revenue=('revenue', 'max')
)
print(result)