我的目标是使用 Pandas 在 Python 中连接来自不同源的两个数据帧,然后用同一列中的相应值填充列中的空值。
数据框具有相似的列,但由于数据源的变化,某些文本/对象列可能具有不同的值。例如,一个数据框中的“名称”列可能包含“Nick M”。另一个是“Nick Maison”。但是,某些列(例如“日期”(格式为 YYYY-MM-DD)、“订单 ID”(数字)和“员工 ID”(数字))在两个数据帧中具有一致的值(我们根据它们连接数据帧)。值得一提的是,某些列甚至可能不存在于一个或另一个数据框中,但也应该填充。
import pandas as pd
# Create DataFrame df1
df1_data = {
'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]
}
df1 = pd.DataFrame(df1_data)
# Create DataFrame df2
df2_data = {
'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]
}
df2 = pd.DataFrame(df2_data)
# Combine DataFrames
outer_joined_df = pd.merge(
df1,
df2,
how = 'outer',
left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]
)
# Display the result
outer_joined_df
这是连接数据帧的输出。应填充黄色的空值。
我尝试了下面的代码,它按预期适用于日期、订单 ID 和员工 ID 列(因为它们在两个数据帧中是相同的,并且我们基于它们进行连接),但不适用于其他列,因为它们可能具有不同的值。基本上,此代码中的逻辑是如果为 Null,则填充指定列中同一行的值。但是,由于值可能不同,填充列会变得混乱,因为它具有同一值的多个变体。
outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])
outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])
outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])
outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])
outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])
这是输出:
如您所见,它填充了数据,但不是我想要的方式。
我需要的输出:
# a list with all column names, minus `(dfx)`
columns = ["Date", "Order Id", "Employee Id", "Name", "Region", "Value"]
# create a dict with a relation between values in df1 and df2, both ways
value_relations = {}
for col in columns:
relations = (
outer_joined_df[[f"{col} (df1)", f"{col} (df2)"]]
.drop_duplicates()
.dropna()
.to_dict("tight")
.get("data")
)
value_relations[col] = {k: v for k, v in relations}
value_relations[col].update({v: k for k, v in relations})
for col in columns:
# fill values of df1 with the related value of df2
outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
outer_joined_df[f"{col} (df2)"].map(value_relations[col])
)
# fill values of df2 with the related value of df1
outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
outer_joined_df[f"{col} (df1)"].map(value_relations[col])
)
# fill remaining null values of df1
outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
outer_joined_df[f"{col} (df2)"]
)
# fill remaining null values of df2
outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
outer_joined_df[f"{col} (df1)"]
)
[9 rows x 12 columns]
Date (df1) Order Id (df1) Employee Id (df1) Name (df1) Region (df1) ... Order Id (df2) Employee Id (df2) Name (df2) Region (df2) Value (df2)
0 2024-03-18 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 25.0
1 2024-03-18 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 37.0
2 2024-03-18 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 19.0
3 2024-03-18 4.0 825.0 Nick M. SD ... 4.0 825.0 Nick Mason San Diego 24.0
4 2024-03-18 5.0 825.0 Nick M. SD ... 5.0 825.0 Nick Mason San Diego 56.0
5 2024-03-19 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 22.0
6 2024-03-19 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 17.0
7 2024-03-19 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 9.0
8 2024-03-19 4.0 825.0 Nick M. SD ... 4.0 825.0 Nick Mason San Diego 76.0