请忍受我这里愚蠢而复杂的简单示例 - 我真的试图获得一个非常复杂的问题的简单版本。
假设我有一个数据框
data
和一个函数 arbfun
,由以下给出:
import pandas as pd
data = pd.DataFrame({'Label1':[1,2,2,1,1,2],
'Label2':['north', 'north', 'north', 'south', 'south', 'south'],
'A':[2,4,6,8,10,12],'B':[4,1,37,1,1,1]})
def arbFun(col1, col2):
"""
Calculates the average value of col1 for all values of (col2 == 1).
Args:
col1: A pandas Series containing the values to be averaged.
col2: A pandas Series containing the filter conditions.
Returns:
The average value of col1 for all values of (col2 == 1).
"""
df = pd.DataFrame({'Col1':col1, 'Col2':col2})
# Filter the data based on the condition
filtered_data = df[df['Col2'] == 1]
# Calculate the average
if len(filtered_data) > 0:
average = filtered_data['Col1'].mean()
else:
average = None # Return None if no data meets the condition
return average
如果我对整个数据运行 ArbFun,我得到的值为 8.5。到目前为止,一切都很好。但现在,我想按两个标签列进行分组并输出 ArbFun 的结果以及一些附加信息。
output = data.groupby(['Label1', 'Label2']).agg(
Column_A =('A', 'sum'),
Filtered_Mean = lambda x: arbFun(x['A'], x['B'])
)
现在应该输出类似的内容
Label1,Label2,Column_A,Filtered_Mean
1,'north',2,None
2,'north',10,4
1,'south',18,9
2,'south',12,12
但是,我收到了类型错误:
TypeError: Must provide 'func' or tuples of '(column, aggfunc)
。我努力试图理解这是从哪里来的,但到目前为止都失败了。我做错了什么?
代码
g1 = data.groupby(['Label1', 'Label2'])['A']
g2 = data[data['B'].eq(1)].groupby(['Label1', 'Label2'])['A']
out = g1.sum().to_frame(name='Column_A').assign(Filtered_Mean=g2.mean()).reset_index()
出
Label1 Label2 Column_A Filtered_Mean
0 1 north 2 NaN
1 1 south 18 9.0
2 2 north 10 4.0
3 2 south 12 12.0