我有2个独立的数据框,data1和df_attr。对于df1,有层次结构数据,其中df1.manager_01_email位于层次结构的顶部,而df1.manager_04_email代表最低层次结构
data1 = [["quarter","employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
[["y2022q1",1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"],
["y2022q1",1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
["y2022q2",1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
["y2022q2",1011, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
["y2022q2",1012, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]
data_attr = [["quarter","supervisor_email", "attrition_rate"],
[ ["y2022q1","[email protected]", 0.3], ["y2022q2","[email protected]", 0.6],["y2022q1","[email protected]", 0.25],["y2022q2","[email protected]", 0.1],["y2022q1","[email protected]", 0.7],["y2022q2","[email protected]", 0.35],["y2022q2","[email protected]", np.NaN],["y2022q1","[email protected]", 0.1],["y2022q2","[email protected]", 0.8]]]
df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])
我需要 data_attr df 中的“attrition_rate”列,但由于 NaN 值以及 attrition_rate 的缺失值,我需要创建一个函数来处理 NaN 值
例如,对于“employee_id”= 1012,他的supervisor_email“[email protected]”在y2022q2的df_attr.attrition_rate中具有NaN值,我需要用更高层次结构中经理的attrition_rate值来填充它,但是因为在 manager_04_email 和 manager_03_email 中都找到了“[email protected]”,所以我们必须向上遍历下一个层次结构,即 y2022q2 的“[email protected]
”的 manager_02_email而对于“employeed_id”= 1012,他的supervisor_email“[email protected]”在df_attr中找不到,所以合并后他的attrition_rate值为NaN。我需要用更高层次结构中经理的 attrition_rate 值填充它,在本例中为“[电子邮件受保护]”,位于 y2022q1
我想要达到的结果是
outcome = [["quarter","employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email", "mgr_attrition_rate"],
[["y2022q1",1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.1],
["y2022q1",1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]", 0.25],
["y2022q2",1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.1],
["y2022q2",1011, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.35],
["y2022q2",1012, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.6]]]
df_outcome = pd.DataFrame(data=outcome[1], columns=outcome[0])
我已经根据建议进行了尝试,但仍然无法得到它,感谢任何形式的帮助或帮助,谢谢。
# Merge dataframes based on quarter and supervisor_email
df_outcome = pd.merge(df1, df_attr, how='left',left_on=['quarter','supervisor_email'],right_on=['quarter','supervisor_email'])
#create function to handle NaNs
def fill_nan_with_higher_manager(row):
if pd.isna(row['attrition_rate']):
for i in range(4, 0, -1):
higher_manager_email = row[f'manager_0{i}_email']
if pd.notna(higher_manager_email):
higher_manager_attrition = df_attr.loc[(df_attr["supervisor_email"] == higher_manager_email) & (df_attr["quarter"] == higher_manager_quarter), 'attrition_rate']
if not higher_manager_attrition.empty:
return higher_manager_attrition.values[0]
return row['attrition_rate']
df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)
你就快到了。您需要处理可能出现 NaN 的情况。为此,我创建了一个函数来处理这些情况。如果您愿意,当然可以将其移到
merge_attrition_rates
函数之外。
import pandas as pd
import numpy as np
data1 = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
[[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"],
[1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
[1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
[1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
[1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]
data_attr = [["supervisor_email", "attrition_rate"],
[ ["[email protected]", 0.3], ["[email protected]", 0.25],["[email protected]", 0.35],
["[email protected]", np.NaN],["[email protected]", 0.1]]]
df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])
def merge_attrition_rates(df1, df_attr):
"""
Merges the 'attrition_rate' column from df_attr into df1 based on supervisor_email.
Handles NaN values and missing supervisor emails.
Args:
df1 (pd.DataFrame): Hierarchy dataframe with columns 'employee_id', 'supervisor_email', and others.
df_attr (pd.DataFrame): Attributes dataframe with columns 'supervisor_email' and 'attrition_rate'.
Returns:
pd.DataFrame: Merged dataframe with the 'attrition_rate' column added to df1.
"""
df_outcome = pd.merge(df1, df_attr, on='supervisor_email', how='left', suffixes=('', '_supervisor'))
def fill_nan_with_higher_manager(row):
if pd.isna(row['attrition_rate']):
for i in range(4, 0, -1):
manager_email = row[f'manager_0{i}_email']
if pd.notna(manager_email):
manager_attrition = df_attr.loc[df_attr['supervisor_email'] == manager_email, 'attrition_rate']
if not manager_attrition.empty:
return manager_attrition.values[0]
return row['attrition_rate']
df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)
missing_supervisors = df_outcome[df_outcome['attrition_rate'].isna()]['supervisor_email']
df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'attrition_rate'] = df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'manager_01_email'].map(df_attr.set_index('supervisor_email')['attrition_rate'])
return df_outcome
df_outcome = merge_attrition_rates(df1, df_attr)
print(df_outcome)
这会回来
employee_id supervisor_email manager_01_email manager_02_email \
0 1011 [email protected] [email protected] [email protected]
1 1012 [email protected] [email protected] [email protected]
2 1013 [email protected] [email protected] [email protected]
3 1014 [email protected] [email protected] [email protected]
4 1015 [email protected] [email protected] [email protected]
manager_03_email manager_04_email attrition_rate
0 [email protected] [email protected] 0.10
1 [email protected] [email protected] 0.25
2 [email protected] [email protected] 0.25
3 [email protected] [email protected] 0.35
4 [email protected] [email protected] NaN