根据条件删除重复的熊猫

问题描述 投票:0回答:1

所以我尝试使用以下逻辑从数据框中删除重复项:

df = pd.DataFrame({'id_SAP_transaction': [1, 2, 2, 3, 3], 'checkout_security': ['2023-12-15', pd.NaT, '2023-12-01', pd.NaT, '2023-11-30'], 'nopol': ['AA123', 'BB456', 'CC789', 'DD101', 'EE234']})

processed_df = process_dataframe(df.copy())

 我想要如果 id_Sap 是重复的,但在 checkout_Security 上它有 NaT 和日期保留日期,如果在 checkout_security 列上只有 NaT 选择其中之一,如果在 checkout_security 上是所有日期,请选择最新日期。 这是我的代码:

def process_dataframe(df, no_date_value=pd.NaT):
    """Processes the DataFrame for duplicate removal and checkout_security handling.

    Args:
        df (pandas.DataFrame): The DataFrame to process.
        no_date_value (pd.NaT): Value representing a missing checkout date.

    Returns:
        pandas.DataFrame: The processed DataFrame.
    """

    try:
        # Ensure 'checkout_security' is datetime
        df['checkout_security'] = pd.to_datetime(
            df['checkout_security'], errors='coerce')

        # Sort by checkout_security (descending, NaT last)
        df = df.sort_values(by='checkout_security',
                            ascending=False, na_position='last')

        def handle_duplicates(group):
            # If there's at least one valid date
            if not group['checkout_security'] is pd.NaT:
                print(group)
                idx = group.index[0]
                return group.loc[idx]  # Return the row using the index

            else:
                first_row = group.iloc[0]
                return first_row
        # Remove duplicates based on 'id_SAP_transaction'
        df = df.drop_duplicates(subset='id_SAP_transaction', keep='last').apply(
            handle_duplicates, axis=1)

        return df

    except (ValueError, KeyError) as e:
        print(f"Error during DataFrame processing: {e}")
        return df  # Optionally, return partially processed DF
    except Exception as e:
        print(f"An error occurred: {type(e).__name__} - {e}")
        return df  # Optionally, return partially processed DF

而不是得到低于预期的结果:

0 1 2023-12-15 AA123 2 2
2023-12-01 CC789 4 3 2023-11-30 EE234

我得到了这个:0 1 2023-12-15 AA123 2
2 2023-12-01 CC789 4 3 2023-11-30 EE234 1 2 NaT BB456 3
3 NaT DD101

对此有什么帮助吗?

python pandas dataframe
1个回答
0
投票

请尝试这个

# Step1. Sorting by id_SAP_transaction and checkout_security
df = df.sort_values(['id_SAP_transaction', 'checkout_security'], ascending=[True, True])

# Step2. Dropping Duplicates
df_temp = df.drop_duplicates('id_SAP_transaction')
© www.soinside.com 2019 - 2024. All rights reserved.