我有 3 列,表 A 和表 B 的名称、日期和数量。我添加了列 D,它是列名称、日期和数量的串联结果。我想用表 B 查找表 A 中的 D 列。如果匹配,则 E 列的输出为“是”;如果没有匹配项,则为“否”。
如果 E 列中的输出为“否”,给定 D 列(串联),我想确定列名、日期或/和数量是否是不匹配的原因。例如,如果名称不匹配,则在列 F 中返回输出为“不匹配名称”,否则返回输出为“匹配名称”。
我现在遇到的问题是我发现 Name 的输出(匹配或不匹配)是正确的,但 Date 和 Quantity 不正确。我觉得主要是一对多,多对多的关系,其中Name,Date,Quantity里面有多次重复
我的代码不一致,因为我不时修改它们,因为输出不正确,尤其是日期和数量。到目前为止,这是我尝试过的:
#Concate the 3 columns
df2_A = df1_A.copy()
df2_A.loc[:, 'A_Concate'] = df2_A['Name'].astype(str) + df2_of01['Date'].astype(str) + df2_A['Quantity'].astype(str)
df2_B = df1_B.copy()
df2_B.loc[:, 'B_Concate'] = df2_B['Name'].astype(str) + df2_Name['Date'].astype(str) + df2_B['Quantity'].astype(int).astype(str)
#Vlookup concatenated column for Table A and B
df2_A ['Match with B?'] = df2_A ['A_Concate'].isin(df2_B['B_Concate']).map({True: 'Yes', False: 'No'})
#Find reason of not match
df2_A ['Match name?'] = df2_A .apply(lambda row: 'Not match name' if row['Match with B?'] == 'No' and row['name'] not in df2_B['Name'].unique() else 'Match name', axis=1)
df2_A ['Match date?'] = df2_A .apply(lambda row: 'Match date' if row['Match with B?'] == 'Yes' else ('Not match date' if row['Date'] not in df2_B.loc[df2_B['B_Concate'] == row['A_Concate'], 'Date'].values else 'Match date'), axis=1)
df2_A ['Match quantity?'] = df2_A .apply(lambda row: 'Not match quantity' if row['Match with B?'] == 'No' and row['Match part?'] == 'Not match part' else ('Not match quantity' if row['Match with B?'] == 'No' and row['SUGGESTED QTY'] not in df2_B['Quantity'].unique() else 'Match quantity'), axis=1)
哪一部分可以改进,以便根据连接的行返回输出?
merge
:
out = (pd.merge(df2_A, df2_B, on=list(df2_A.columns), how="left", indicator="Match with B?")
.replace({"Match with B?": {"both": "Yes", "left_only": "No"}}))
out["Why ?"] = (pd.concat([pd.merge(df2_A[[col]].drop_duplicates(), df2_B[[col]].drop_duplicates(),
on=col, how="left", indicator=f"check_{i}")
for i, col in enumerate(df2_A.columns)], axis=1).filter(like="check")
.set_axis(df2_A.columns, axis=1).replace({"both": True, "left_only": False})
.apply(lambda x: np.where(x.eq(False), x.name, None)).stack().groupby(level=0).agg(list)
)
输出:
print(out)
Name Date Quantity Match with B? Why ?
0 foo 2023-02-11 1 No [Date, Quantity]
1 bar 2023-03-22 2 Yes NaN
2 baz 2023-01-05 3 No [Name, Date, Quantity]
3 qux 2023-04-18 4 No [Name, Date]
4 bar 2023-05-01 5 No [Date]
如果您对每一列进行匹配检查,请使用:
tmp = (pd.merge(df2_A, df2_B, on=list(df2_A.columns), how="left", indicator="Match with B?")
.replace({"Match with B?": {"both": "Yes", "left_only": "No"}}))
out = tmp.join(pd.concat([pd.merge(df2_A[[col]].drop_duplicates(), df2_B[[col]].drop_duplicates(),
on=col, how="left", indicator=f"check_{i}")
for i, col in enumerate(df2_A.columns)], axis=1).filter(like="check")
.set_axis(df2_A.columns, axis=1).replace({"both": True, "left_only": False})
.add_prefix("Match ").add_suffix(" ?").replace({True: "Yes", False: "No"}).fillna("Yes")
)
输出:
print(out)
Name Date Quantity Match with B? Match Name ? Match Date ? Match Quantity ?
0 foo 2023-02-11 1 No Yes No No
1 bar 2023-03-22 2 Yes Yes Yes Yes
2 baz 2023-01-05 3 No No No No
3 qux 2023-04-18 4 No No No Yes
4 bar 2023-05-01 5 No Yes No Yes
高亮显示结果:
使用的输入:
df2_A = pd.DataFrame({
"Name": ["foo", "bar", "baz", "qux", "bar"],
"Date": ["2023-02-11", "2023-03-22", "2023-01-05", "2023-04-18", "2023-05-01"],
"Quantity": [1, 2, 3, 4, 5]
})
df2_B = pd.DataFrame({
"Name": ["foo", "xyz", "foo", "bar"],
"Date": ["2023-02-30", "2023-02-25", "2023-03-10", "2023-03-22"],
"Quantity": [5, 4, 6, 2]
})