创建一个函数通过查找层次结构列来替换 NaN 值

问题描述 投票:0回答:1

我有2个独立的数据框,data1和df_attr。对于df1,有层次结构数据,其中df1.manager_01_email位于层次结构的顶部,而df1.manager_04_email代表最低层次结构

data1 = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
         [[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"], 
          [1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
          [1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          [1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          [1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]


data_attr = [["supervisor_email", "attrition_rate"],
         [ ["[email protected]", 0.3], ["[email protected]", 0.25],["[email protected]", 0.35],
          ["[email protected]", np.NaN],["[email protected]", 0.1]]]


df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])

我需要 data_attr df 中的“attrition_rate”列,但由于 NaN 值以及 attrition_rate 的缺失值,我需要创建一个具有 2 个条件的函数 解决 NaN 值的条件 1:对于“employee_id”= 1015,他的supervisor_email“[email protected]”在 attrition_rate 列中具有 NaN 值,我需要用较高层次结构中经理的 attrition_rate 值来填充它,这是“[电子邮件受保护]” 条件2解决缺失值:对于“employeed_id”= 1012,在df_attr中找不到他的supervisor_email“[email protected]”,我需要用更高层次结构中经理的attrition_rate值来填充它,在这个案例为“[电子邮件受保护]

我想要达到的结果是

outcome = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email", "mgr_attrition_rate"],
         [[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.1], 
          [1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]", 0.3],
          [1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.25],
          [1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.35],
          [1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.3]]]

df_outcome = pd.DataFrame(data=outcome[1], columns=outcome[0])

我已经按照下面的代码尝试过,但它似乎不起作用。感谢任何形式的帮助或协助,谢谢!

def merge_attrition_rates(df1, df_attr):
    """
    Merges the 'attrition_rate' column from df_attr into df1 based on supervisor_email.
    Handles NaN values and missing supervisor emails.

    Args:
        df1 (pd.DataFrame): Hierarchy dataframe with columns 'employee_id', 'supervisor_email', and others.
        df_attr (pd.DataFrame): Attributes dataframe with columns 'supervisor_email' and 'attrition_rate'.

    Returns:
        pd.DataFrame: Merged dataframe with the 'attrition_rate' column added to df1.
    """
    # Merge based on supervisor_email
    df_outcome = pd.merge(df1, df_attr, on='supervisor_email', how='left')

    # Condition 1: Fill NaN values with attrition_rate of higher manager
    df_outcome['attrition_rate'].fillna(df_outcome["manager_01_email"].map(df_attr.set_index("supervisor_email")["attrition_rate"]), inplace=True)

 # Condition 2: Fill missing supervisor_email with manager's attrition_rate
    missing_supervisors = df_outcome[df_outcome["attrition_rate"].isna()]["supervisor_email"]
    df_outcome.loc[df_outcome["supervisor_email"].isin(missing_supervisors), "attrition_rate"] = df_outcome.loc[df_outcome["supervisor_email"].isin(missing_supervisors), "manager_01_email"].map(df_attr.set_index("supervisor_email")["attrition_rate"])

    return df_outcome

# apply function to create new df:
df_outcome = merge_attrition_rates(df1, df_attr)
print(df_outcome)
python-3.x pandas dataframe function hierarchy
1个回答
0
投票

你就快到了。您需要处理可能出现 NaN 的情况。为此,我创建了一个函数来处理这些情况。如果您愿意,当然可以将其移到

merge_attrition_rates
函数之外。

import pandas as pd
import numpy as np

data1 = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
         [[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"], 
          [1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
          [1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          [1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          [1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]

data_attr = [["supervisor_email", "attrition_rate"],
         [ ["[email protected]", 0.3], ["[email protected]", 0.25],["[email protected]", 0.35],
          ["[email protected]", np.NaN],["[email protected]", 0.1]]]

df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])

def merge_attrition_rates(df1, df_attr):
    """
    Merges the 'attrition_rate' column from df_attr into df1 based on supervisor_email.
    Handles NaN values and missing supervisor emails.

    Args:
        df1 (pd.DataFrame): Hierarchy dataframe with columns 'employee_id', 'supervisor_email', and others.
        df_attr (pd.DataFrame): Attributes dataframe with columns 'supervisor_email' and 'attrition_rate'.

    Returns:
        pd.DataFrame: Merged dataframe with the 'attrition_rate' column added to df1.
    """
    df_outcome = pd.merge(df1, df_attr, on='supervisor_email', how='left', suffixes=('', '_supervisor'))

    def fill_nan_with_higher_manager(row):
        if pd.isna(row['attrition_rate']):
            for i in range(4, 0, -1):
                manager_email = row[f'manager_0{i}_email']
                if pd.notna(manager_email):
                    manager_attrition = df_attr.loc[df_attr['supervisor_email'] == manager_email, 'attrition_rate']
                    if not manager_attrition.empty:
                        return manager_attrition.values[0]
        return row['attrition_rate']

    df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)

    missing_supervisors = df_outcome[df_outcome['attrition_rate'].isna()]['supervisor_email']
    df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'attrition_rate'] = df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'manager_01_email'].map(df_attr.set_index('supervisor_email')['attrition_rate'])

    return df_outcome

df_outcome = merge_attrition_rates(df1, df_attr)
print(df_outcome)

这会回来

 employee_id    supervisor_email manager_01_email    manager_02_email  \
0         1011    [email protected]     [email protected]  [email protected]   
1         1012   [email protected]     [email protected]  [email protected]   
2         1013    [email protected]     [email protected]  [email protected]   
3         1014  [email protected]     [email protected]  [email protected]   
4         1015     [email protected]     [email protected]  [email protected]   

     manager_03_email    manager_04_email  attrition_rate  
0     [email protected]    [email protected]            0.10  
1    [email protected]   [email protected]            0.25  
2    [email protected]    [email protected]            0.25  
3  [email protected]  [email protected]            0.35  
4     [email protected]     [email protected]             NaN  
© www.soinside.com 2019 - 2024. All rights reserved.