创建一个函数,通过向上遍历层次结构列来替换 NaN 值

问题描述 投票:0回答:1

我有2个独立的数据框,data1和df_attr。对于df1,有层次结构数据,其中df1.manager_01_email位于层次结构的顶部,而df1.manager_04_email代表最低层次结构

data1 = [["quarter","employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
         [["y2022q1",1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"], 
          ["y2022q1",1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
          ["y2022q2",1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          ["y2022q2",1011, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          ["y2022q2",1012, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]


data_attr = [["quarter","supervisor_email", "attrition_rate"],
         [ ["y2022q1","[email protected]", 0.3], ["y2022q2","[email protected]", 0.6],["y2022q1","[email protected]", 0.25],["y2022q2","[email protected]", 0.1],["y2022q1","[email protected]", 0.7],["y2022q2","[email protected]", 0.35],["y2022q2","[email protected]", np.NaN],["y2022q1","[email protected]", 0.1],["y2022q2","[email protected]", 0.8]]]


df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])

我需要 data_attr df 中的“attrition_rate”列,但由于 NaN 值以及 attrition_rate 的缺失值,我需要创建一个函数来处理 NaN 值

例如,对于“employee_id”= 1012,他的supervisor_email“[email protected]”在y2022q2的df_attr.attrition_rate中具有NaN值,我需要用更高层次结构中经理的attrition_rate值来填充它,但是因为在 manager_04_email 和 manager_03_email 中都找到了“[email protected]”,所以我们必须向上遍历下一个层次结构,即 y2022q2 的“[email protected]

”的 manager_02_email

而对于“employeed_id”= 1012,他的supervisor_email“[email protected]”在df_attr中找不到,所以合并后他的attrition_rate值为NaN。我需要用更高层次结构中经理的 attrition_rate 值填充它,在本例中为“[电子邮件受保护]”,位于 y2022q1

我想要达到的结果是

outcome = [["quarter","employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email", "mgr_attrition_rate"],
         [["y2022q1",1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.1], 
          ["y2022q1",1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]", 0.25],
          ["y2022q2",1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.1],
          ["y2022q2",1011, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.35],
          ["y2022q2",1012, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.6]]]

df_outcome = pd.DataFrame(data=outcome[1], columns=outcome[0])

我已经根据建议进行了尝试,但仍然无法得到它,感谢任何形式的帮助或帮助,谢谢。

# Merge dataframes based on quarter and supervisor_email
df_outcome = pd.merge(df1, df_attr, how='left',left_on=['quarter','supervisor_email'],right_on=['quarter','supervisor_email'])

#create function to handle NaNs
def fill_nan_with_higher_manager(row):
        if pd.isna(row['attrition_rate']):
            for i in range(4, 0, -1):
                higher_manager_email = row[f'manager_0{i}_email']
                if pd.notna(higher_manager_email):
                    higher_manager_attrition = df_attr.loc[(df_attr["supervisor_email"] == higher_manager_email) & (df_attr["quarter"] == higher_manager_quarter), 'attrition_rate']
                    if not higher_manager_attrition.empty:
                        return higher_manager_attrition.values[0]
        return row['attrition_rate']

df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)

python-3.x pandas dataframe function hierarchy
1个回答
0
投票

你就快到了。您需要处理可能出现 NaN 的情况。为此,我创建了一个函数来处理这些情况。如果您愿意,当然可以将其移到

merge_attrition_rates
函数之外。

import pandas as pd
import numpy as np

data1 = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
         [[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"], 
          [1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
          [1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          [1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          [1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]

data_attr = [["supervisor_email", "attrition_rate"],
         [ ["[email protected]", 0.3], ["[email protected]", 0.25],["[email protected]", 0.35],
          ["[email protected]", np.NaN],["[email protected]", 0.1]]]

df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])

def merge_attrition_rates(df1, df_attr):
    """
    Merges the 'attrition_rate' column from df_attr into df1 based on supervisor_email.
    Handles NaN values and missing supervisor emails.

    Args:
        df1 (pd.DataFrame): Hierarchy dataframe with columns 'employee_id', 'supervisor_email', and others.
        df_attr (pd.DataFrame): Attributes dataframe with columns 'supervisor_email' and 'attrition_rate'.

    Returns:
        pd.DataFrame: Merged dataframe with the 'attrition_rate' column added to df1.
    """
    df_outcome = pd.merge(df1, df_attr, on='supervisor_email', how='left', suffixes=('', '_supervisor'))

    def fill_nan_with_higher_manager(row):
        if pd.isna(row['attrition_rate']):
            for i in range(4, 0, -1):
                manager_email = row[f'manager_0{i}_email']
                if pd.notna(manager_email):
                    manager_attrition = df_attr.loc[df_attr['supervisor_email'] == manager_email, 'attrition_rate']
                    if not manager_attrition.empty:
                        return manager_attrition.values[0]
        return row['attrition_rate']

    df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)

    missing_supervisors = df_outcome[df_outcome['attrition_rate'].isna()]['supervisor_email']
    df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'attrition_rate'] = df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'manager_01_email'].map(df_attr.set_index('supervisor_email')['attrition_rate'])

    return df_outcome

df_outcome = merge_attrition_rates(df1, df_attr)
print(df_outcome)

这会回来

 employee_id    supervisor_email manager_01_email    manager_02_email  \
0         1011    [email protected]     [email protected]  [email protected]   
1         1012   [email protected]     [email protected]  [email protected]   
2         1013    [email protected]     [email protected]  [email protected]   
3         1014  [email protected]     [email protected]  [email protected]   
4         1015     [email protected]     [email protected]  [email protected]   

     manager_03_email    manager_04_email  attrition_rate  
0     [email protected]    [email protected]            0.10  
1    [email protected]   [email protected]            0.25  
2    [email protected]    [email protected]            0.25  
3  [email protected]  [email protected]            0.35  
4     [email protected]     [email protected]             NaN  
© www.soinside.com 2019 - 2024. All rights reserved.