是否有一种有效的方法来比较两个DataFrame或Series之间的每一列的值?

问题描述 投票:0回答:4

我有2个DataFrame,并且我试图找到最佳方法来迭代df_a的每一行,并查看是否有任何值与df_b中的相应行不同。如果一个值都不同,则我想考虑这些行也不同。

示例:

df_a

df_a = pd.DataFrame({'ID':['E1', 'E2', 'E3'], 
                     'NAME': ['John', 'Jane', 'Steve'], 
                     'ROLE': ['Analyst', 'Manager', 'Intern'], 
                     'LOCATION': ['San Francisco', 'New York City', 'Houston']})

    ID  NAME    ROLE      LOCATION
0   E1  John    Analyst   San Francisco
1   E2  Jane    Manager   New York City
2   E3  Steve   Intern    Houston

df_b

df_b = pd.DataFrame({'ID':['E1', 'E2', 'E3'], 
                     'NAME': ['John', 'Jane', 'Steve'], 
                     'ROLE': ['Analyst', 'Manager', 'Analyst'], 
                     'LOCATION': ['San Francisco', 'Chicago', 'Houston']})

    ID  NAME    ROLE      LOCATION
0   E1  John    Analyst   San Francisco
1   E2  Jane    Manager   Chicago
2   E3  Steve   Analyst   Houston

在上面的两个DataFrames中,我想捕捉到E2和E3已经更改的事实,因此我可以将它们作为“更新的”行转发到我的代码中。

我当前的方法是一种“蛮力”,对于较大的数据集,它[较慢。我很好奇是否存在比仅在所有行和列上进行显式迭代更有效/更优雅的方法。我应该注意,我的实际数据包含带有自由文本字段的几列,因此我不确定这是否可能是代码缓慢行为的根源。

当前方法

df_updates = pd.DataFrame(columns=df_a.columns) for ix, a_row in df_a.iterrows(): # get the matching from from df_b b_row = df_b[df_b['ID'] == a_row['ID']].iloc[0] for column in a_row.index: # check the column exists in df_b if column in b_row.index: # check if the values are the same if a_row[column] != b_row[column]: # if anything is different, capture the row df_updates = df_updates.append(a_row, ignore_index=True) break # break from the current iteration because we already confirmed that something has changed else: # If the column does not exist in df_b, then it must be a new field df_updates = df_updates.append(a_row, ignore_index=True) break
此代码将呈现以下结果:

ID NAME ROLE LOCATION 0 E2 Jane Manager New York City 1 E3 Steve Intern Houston

python pandas dataframe for-loop series
4个回答
1
投票
您可以使用pandas.DataFrame.merge

df_merge = df_a.merge(df_b, on=df_a.columns.tolist(), how='left',indicator=True) df_merge[df_merge['_merge'] == 'left_only'].drop(columns=["_merge"]) ID NAME ROLE LOCATION 1 E2 Jane Manager New York City 2 E3 Steve Intern Houston


0
投票
[df_adf_b的行合并:

rows_a = df_a.iloc[:,1].str.cat(df_a.iloc[:,2:],sep=',') rows_b = df_b.iloc[:,1].str.cat(df_b.iloc[:,2:],sep=',') result = df_a.loc[rows_a != rows_b] print(result) ID NAME ROLE LOCATION 1 E2 Jane Manager New York City 2 E3 Steve Intern Houston


0
投票
使用pandas.DataFrame.set_indexne

df_a = df_a.set_index("ID") df_b = df_b.set_index("ID") print(df_a[df_a.ne(df_b).any(1)].reset_index())

输出:

ID LOCATION NAME ROLE 0 E2 New York City Jane Manager 1 E3 Houston Steve Intern


0
投票
在名为.duplicated的新数据框中使用df_b并过滤df_c >>

df_a['df_name'], df_b['df_name'] = 'df_a', 'df_b' df_c = df_a.append(df_b) df_c = df_c[(~df_c.duplicated(['ID', 'NAME', 'ROLE', 'LOCATION'], keep=False)) & (df_c['df_name'] == 'df_b')].drop('df_name', axis=1) df_c

输出:

ID NAME ROLE LOCATION 1 E2 Jane Manager Chicago 2 E3 Steve Analyst Houston

© www.soinside.com 2019 - 2024. All rights reserved.