如何在 Python Pandas 中进行外连接后填充列中的空值

问题描述 投票:0回答:1

我的目标是使用 Pandas 在 Python 中连接来自不同源的两个数据帧,然后用同一列中的相应值填充列中的空值。

数据框具有相似的列,但由于数据源的变化,某些文本/对象列可能具有不同的值。例如,一个数据框中的“名称”列可能包含“Nick M”。另一个是“Nick Maison”。但是,某些列(例如“日期”(格式为 YYYY-MM-DD)、“订单 ID”(数字)和“员工 ID”(数字))在两个数据帧中具有一致的值(我们根据它们连接数据帧)。值得一提的是,某些列甚至可能不存在于一个或另一个数据框中,但也应该填充。

import pandas as pd

# Create DataFrame df1

df1_data = {

'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]

}

df1 = pd.DataFrame(df1_data)

# Create DataFrame df2

df2_data = {

'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],  
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],  
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],  
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]  

}

df2 = pd.DataFrame(df2_data)

# Combine DataFrames

outer_joined_df = pd.merge(

                            df1,
                            df2,
                            how = 'outer',
                            left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
                            right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]

                        )

# Display the result

outer_joined_df

这是连接数据帧的输出。应填充黄色的空值。

我尝试了下面的代码,它按预期适用于日期、订单 ID 和员工 ID 列(因为它们在两个数据帧中是相同的,并且我们基于它们进行连接),但不适用于其他列,因为它们可能具有不同的值。基本上,此代码中的逻辑是如果为 Null,则填充指定列中同一行的值。但是,由于值可能不同,填充列会变得混乱,因为它具有同一值的多个变体。

outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])

outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])

outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])

outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])

outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])

这是输出:

如您所见,它填充了数据,但不是我想要的方式。

我需要的输出:

python pandas join outer-join
1个回答
0
投票
# a list with all column names, minus `(dfx)`
columns = ["Date", "Order Id", "Employee Id", "Name", "Region", "Value"]

# create a dict with a relation between values in df1 and df2, both ways
value_relations = {}
for col in columns:
    relations = (
        outer_joined_df[[f"{col} (df1)", f"{col} (df2)"]]
        .drop_duplicates()
        .dropna()
        .to_dict("tight")
        .get("data")
    )
    value_relations[col] = {k: v for k, v in relations}
    value_relations[col].update({v: k for k, v in relations})

for col in columns:
    # fill values of df1 with the related value of df2
    outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
        outer_joined_df[f"{col} (df2)"].map(value_relations[col])
    )
    # fill values of df2 with the related value of df1
    outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
        outer_joined_df[f"{col} (df1)"].map(value_relations[col])
    )
    # fill remaining null values of df1
    outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
        outer_joined_df[f"{col} (df2)"]
    )
    # fill remaining null values of df2
    outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
        outer_joined_df[f"{col} (df1)"]
    )
[9 rows x 12 columns]
   Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
3  2024-03-18             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        24.0
4  2024-03-18             5.0              825.0    Nick M.           SD  ...             5.0             825.0  Nick Mason     San Diego        56.0
5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
8  2024-03-19             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0
© www.soinside.com 2019 - 2024. All rights reserved.