我有一个数据框,想创建一个函数来根据某些条件保留行或删除重复项
原始数据框
year year_month manager_movement email_address
2022 2022_jun transfer_in [email protected]
2022 2022_jun no_change [email protected]
2022 2022_jul no_change [email protected]
2022 2022_jul no_change [email protected]
2022 2022_aug no_change [email protected]
2022 2022_aug no_change [email protected]
2022 2022_sep transfer_out [email protected]
2022 2022_sep no_change [email protected]
2022 2022_oct transfer_in [email protected]
2022 2022_oct no_change [email protected]
2023 2023_jan no_change [email protected]
2023 2023_feb no_change [email protected]
预期数据框
year year_month manager_movement email_address
2022 2022_jun transfer_in [email protected]
2022 2022_oct transfer_in [email protected]
2022 2022_oct no_change [email protected]
2023 2023_feb no_change [email protected]
获取dataframe的逻辑是这样的 第一:如果 df['manager_movement'] == 'transfer_out',则删除行 第二: elseif df['manager_movement'] == 'transfer_in',然后保留所有行 第三: elseif df['manager_movement'] == 'no_change',然后按 'year' 和 'email_address' 分组并删除重复项并保留最后一行
这是我的尝试,但似乎无法获得我想要的输出。感谢任何帮助或评论,谢谢。
def get_required_rows(x):
if x['manager_movement'] == 'transfer_out':
return x.loc[x['manager_movement']!='transfer_out']
elif x['manager_movement'] == 'transfer_in':
return x
elif x['manager_movement'] == 'No Change':
return x.drop_duplicates(['year','email_address'], keep='last')
end
df_filtered = df.apply(get_required_rows, axis=1)
如何单独进行过滤并连接结果:
pd.concat([
df[df["manager_movement"] == "transfer_in"],
df[df["manager_movement"] == "no_change"].drop_duplicates(["year", "email_address"], keep='last')
])
输出:
year year_month manager_movement email_address
0 2022 2022_jun transfer_in [email protected]
8 2022 2022_oct transfer_in [email protected]
4 2022 2022_aug no_change [email protected]
9 2022 2022_oct no_change [email protected]
11 2023 2023_feb no_change [email protected]
(顺便说一句,您想要的输出似乎不符合要求,缺少 1 行
[email protected]
和 no_change
)