我对 python 很陌生,并且正在解决问题。
# First Dataframe
df1 = pd.DataFrame({
'name': ['Adam','Ashley','Adam','Don',],
'items': ['Apple','Banana','Cherry','Date'],
'Quantity': [10,15,20,25]
})
# second dataframe
df2 = pd.DataFrame({
'name': ['Adam','Ashley','Adam','Sunny'],
'items': ['Apple','Banana','Scale','Pickle'],
'Quantity': [11,10,15,20]
})
以上两个数据框有一些相似的值和一些不相似的值。
我想将这两者结合起来并准备一个数据框 - 具有:
循环遍历两个数据帧并收集一个数据帧中每个名称的相似值。例如:Adam 有两个条目。
然后循环查看问题是什么 - 例如。 商品不匹配或数量不匹配并用其填充新列“原因”。对于剩菜,我只需要添加 not available in df1/df2
我想在每次循环(1)次迭代后保留一个空行,即在每个名称通过后。
# Result
df3 = pd.DataFrame({
'name': ['Adam', 'Adam', None, 'Ashley', None, 'Don',None, None],
'items': ['Apple', 'Cherry', None, 'Banana', None, 'Date', None, None],
'Quantity': [10, 15, None, 20, None, 25, None, None],
'name_2': ['Adam', 'Adam', None, 'Ashley', None, None, None, 'Sunny'],
'items_2': ['Apple', 'Scale', None, 'Banana', None, None, None, 'Pickle'],
'Quantity_2': [11, 10, None, 15, None, None, None, 20],
'Reason' : ['Quantity mismatch', 'Item mismatch', None, 'Quantity mismatch', None, 'Does not exist in df2', None, 'does not exist in df1']
})
我非常感谢这方面的任何帮助。预先感谢您!
我从多个来源积累了这些行,当然,它不起作用并显示各种错误。
new_df = pd.DataFrame()
for item in df1["name"]:
idx = df2[df2["name"].eq(item)].min()
idx2 = df1[df1["name"].eq(item)].min()
new_df = new_df.append(df1[idx2])
new_df = new_df.append(df2[idx])
for i in idx():
if df2["name"][i] in df1["name"]:
if df2["item"][i] in df1[item]:
new_df["Reason"][i] = "Quantity Mismatch"
else:
new_df["Reason"][i] = "Item Mismatch"
else:
new_df["Reason"][i] = "Does not exist in df1"
这绝对不是我一生中最干净的代码,但也许它有帮助,所以我分享它:
import pandas as pd
# First Dataframe
df1 = pd.DataFrame({
'name': ['Adam', 'Ashley', 'Adam', 'Don',],
'items': ['Apple', 'Banana', 'Cherry', 'Date'],
'Quantity': [10, 15, 20, 25]
})
# second dataframe
df2 = pd.DataFrame({
'name': ['Adam', 'Ashley', 'Adam', 'Sunny'],
'items': ['Apple', 'Banana', 'Scale', 'Pickle'],
'Quantity': [11, 10, 15, 20]
})
def create_result(
df1: pd.DataFrame,
df2: pd.DataFrame,
df2_columns: list
) -> pd.DataFrame:
new_df = pd.DataFrame(columns=[*df1.columns, *df2_columns, "Reason"])
df1 = df1.reset_index()
df1 = df1.sort_values(by=['name'])
prev_name = None
for idx, row in df1.iterrows():
if prev_name != None and prev_name != row['name']:
new_df.loc[len(new_df)] = pd.Series(dtype='float64')
prev_name = row['name']
rwsn = df2.loc[df2['name'] == row['name']]
if rwsn.empty:
new_df.loc[len(new_df)] = [*row.values[1:], None, None, None, "Name not found in the second dataframe."]
else:
rwsi = rwsn.loc[rwsn['items'] == row['items']]
if rwsi.empty:
# Here comes the problem because of the unclear identification, which value should I put into the items_2 column?
new_df.loc[len(new_df)] = [*row.values[1:], rwsn['name'][0], None, None, "Item mismatch."]
elif row['Quantity'] != rwsi['Quantity'].values[0]:
new_df.loc[len(new_df)] = [*row.values[1:], *rwsi.values.tolist()[0], "Quantity Mismatch."]
else:
pass # What should happen if they are identical?
return new_df
print(create_result(
df1=df1,
df2=df2,
df2_columns=['name_2', 'items_2', 'Quantity_2']
))
请考虑到此代码将仅采用第一个表中的值(如左连接),这意味着给定示例中
Sunny
的行将不可见。虽然我不确定展示它是否真的有用。
根据您给出的限制,我不确定是否可以概括此问题的解决方案,正如@Celius Stingher所提到的那样。
您必须更改数据结构以清楚地标识要比较的记录。我的解决方案还假设列有点严格并遵循您提出的命名约定,但我可以想象如果列发生变化,这可能会成为问题。