所以我尝试使用以下逻辑从数据框中删除重复项:
df = pd.DataFrame({'id_SAP_transaction': [1, 2, 2, 3, 3], 'checkout_security': ['2023-12-15', pd.NaT, '2023-12-01', pd.NaT, '2023-11-30'], 'nopol': ['AA123', 'BB456', 'CC789', 'DD101', 'EE234']})
processed_df = process_dataframe(df.copy())
我想要如果 id_Sap 是重复的,但在 checkout_Security 上它有 NaT 和日期保留日期,如果在 checkout_security 列上只有 NaT 选择其中之一,如果在 checkout_security 上是所有日期,请选择最新日期。 这是我的代码:
def process_dataframe(df, no_date_value=pd.NaT):
"""Processes the DataFrame for duplicate removal and checkout_security handling.
Args:
df (pandas.DataFrame): The DataFrame to process.
no_date_value (pd.NaT): Value representing a missing checkout date.
Returns:
pandas.DataFrame: The processed DataFrame.
"""
try:
# Ensure 'checkout_security' is datetime
df['checkout_security'] = pd.to_datetime(
df['checkout_security'], errors='coerce')
# Sort by checkout_security (descending, NaT last)
df = df.sort_values(by='checkout_security',
ascending=False, na_position='last')
def handle_duplicates(group):
# If there's at least one valid date
if not group['checkout_security'] is pd.NaT:
print(group)
idx = group.index[0]
return group.loc[idx] # Return the row using the index
else:
first_row = group.iloc[0]
return first_row
# Remove duplicates based on 'id_SAP_transaction'
df = df.drop_duplicates(subset='id_SAP_transaction', keep='last').apply(
handle_duplicates, axis=1)
return df
except (ValueError, KeyError) as e:
print(f"Error during DataFrame processing: {e}")
return df # Optionally, return partially processed DF
except Exception as e:
print(f"An error occurred: {type(e).__name__} - {e}")
return df # Optionally, return partially processed DF
而不是得到低于预期的结果:
0 1 2023-12-15 AA123 2 2
2023-12-01 CC789 4 3 2023-11-30 EE234我得到了这个:0 1 2023-12-15 AA123 2
2 2023-12-01 CC789 4 3 2023-11-30 EE234 1 2 NaT BB456 3
3 NaT DD101
对此有什么帮助吗?
请尝试这个
# Step1. Sorting by id_SAP_transaction and checkout_security
df = df.sort_values(['id_SAP_transaction', 'checkout_security'], ascending=[True, True])
# Step2. Dropping Duplicates
df_temp = df.drop_duplicates('id_SAP_transaction')