我有2个独立的数据框,data1和df_attr。对于df1,有层次结构数据,其中df1.manager_01_email位于层次结构的顶部,而df1.manager_04_email代表最低层次结构
data1 = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
[[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"],
[1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
[1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
[1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
[1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]
data_attr = [["supervisor_email", "attrition_rate"],
[ ["[email protected]", 0.3], ["[email protected]", 0.25],["[email protected]", 0.35],
["[email protected]", np.NaN],["[email protected]", 0.1]]]
df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])
我需要 data_attr df 中的“attrition_rate”列,但由于 NaN 值以及 attrition_rate 的缺失值,我需要创建一个具有 2 个条件的函数 解决 NaN 值的条件 1:对于“employee_id”= 1015,他的supervisor_email“[email protected]”在 attrition_rate 列中具有 NaN 值,我需要用较高层次结构中经理的 attrition_rate 值来填充它,这是“[电子邮件受保护]” 条件2解决缺失值:对于“employeed_id”= 1012,在df_attr中找不到他的supervisor_email“[email protected]”,我需要用更高层次结构中经理的attrition_rate值来填充它,在这个案例为“[电子邮件受保护]”
我想要达到的结果是
outcome = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email", "mgr_attrition_rate"],
[[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.1],
[1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]", 0.3],
[1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.25],
[1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.35],
[1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.3]]]
df_outcome = pd.DataFrame(data=outcome[1], columns=outcome[0])
我已经按照下面的代码尝试过,但它似乎不起作用。感谢任何形式的帮助或协助,谢谢!
def merge_attrition_rates(df1, df_attr):
"""
Merges the 'attrition_rate' column from df_attr into df1 based on supervisor_email.
Handles NaN values and missing supervisor emails.
Args:
df1 (pd.DataFrame): Hierarchy dataframe with columns 'employee_id', 'supervisor_email', and others.
df_attr (pd.DataFrame): Attributes dataframe with columns 'supervisor_email' and 'attrition_rate'.
Returns:
pd.DataFrame: Merged dataframe with the 'attrition_rate' column added to df1.
"""
# Merge based on supervisor_email
df_outcome = pd.merge(df1, df_attr, on='supervisor_email', how='left')
# Condition 1: Fill NaN values with attrition_rate of higher manager
df_outcome['attrition_rate'].fillna(df_outcome["manager_01_email"].map(df_attr.set_index("supervisor_email")["attrition_rate"]), inplace=True)
# Condition 2: Fill missing supervisor_email with manager's attrition_rate
missing_supervisors = df_outcome[df_outcome["attrition_rate"].isna()]["supervisor_email"]
df_outcome.loc[df_outcome["supervisor_email"].isin(missing_supervisors), "attrition_rate"] = df_outcome.loc[df_outcome["supervisor_email"].isin(missing_supervisors), "manager_01_email"].map(df_attr.set_index("supervisor_email")["attrition_rate"])
return df_outcome
# apply function to create new df:
df_outcome = merge_attrition_rates(df1, df_attr)
print(df_outcome)
你就快到了。您需要处理可能出现 NaN 的情况。为此,我创建了一个函数来处理这些情况。如果您愿意,当然可以将其移到
merge_attrition_rates
函数之外。
import pandas as pd
import numpy as np
data1 = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
[[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"],
[1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
[1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
[1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
[1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]
data_attr = [["supervisor_email", "attrition_rate"],
[ ["[email protected]", 0.3], ["[email protected]", 0.25],["[email protected]", 0.35],
["[email protected]", np.NaN],["[email protected]", 0.1]]]
df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])
def merge_attrition_rates(df1, df_attr):
"""
Merges the 'attrition_rate' column from df_attr into df1 based on supervisor_email.
Handles NaN values and missing supervisor emails.
Args:
df1 (pd.DataFrame): Hierarchy dataframe with columns 'employee_id', 'supervisor_email', and others.
df_attr (pd.DataFrame): Attributes dataframe with columns 'supervisor_email' and 'attrition_rate'.
Returns:
pd.DataFrame: Merged dataframe with the 'attrition_rate' column added to df1.
"""
df_outcome = pd.merge(df1, df_attr, on='supervisor_email', how='left', suffixes=('', '_supervisor'))
def fill_nan_with_higher_manager(row):
if pd.isna(row['attrition_rate']):
for i in range(4, 0, -1):
manager_email = row[f'manager_0{i}_email']
if pd.notna(manager_email):
manager_attrition = df_attr.loc[df_attr['supervisor_email'] == manager_email, 'attrition_rate']
if not manager_attrition.empty:
return manager_attrition.values[0]
return row['attrition_rate']
df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)
missing_supervisors = df_outcome[df_outcome['attrition_rate'].isna()]['supervisor_email']
df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'attrition_rate'] = df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'manager_01_email'].map(df_attr.set_index('supervisor_email')['attrition_rate'])
return df_outcome
df_outcome = merge_attrition_rates(df1, df_attr)
print(df_outcome)
这会回来
employee_id supervisor_email manager_01_email manager_02_email \
0 1011 [email protected] [email protected] [email protected]
1 1012 [email protected] [email protected] [email protected]
2 1013 [email protected] [email protected] [email protected]
3 1014 [email protected] [email protected] [email protected]
4 1015 [email protected] [email protected] [email protected]
manager_03_email manager_04_email attrition_rate
0 [email protected] [email protected] 0.10
1 [email protected] [email protected] 0.25
2 [email protected] [email protected] 0.25
3 [email protected] [email protected] 0.35
4 [email protected] [email protected] NaN