这里完全是新手。将尝试尽可能清楚地解释它:)
所以,我已经收到了这个日志.csv 文件,请参阅下文。
created_at user_id user_email article_id key value
2023-12-05T20:04:45.088Z 111111 [email protected] 1 included -1
2023-12-05T20:05:32.964Z 111111 [email protected] 2 included -1
2023-12-05T20:06:31.980Z 111111 [email protected] 3 included -1
2023-12-05T20:06:33.730Z 111111 [email protected] 3 included -1
2023-12-05T20:06:36.387Z 111111 [email protected] 3 included 1
2023-12-05T20:06:38.621Z 111111 [email protected] 3 included 1
2023-12-05T20:06:56.200Z 111111 [email protected] 3 included -1
我想解决几个问题:
data_last = df.drop_duplicates(subset=['article_id'], keep='last'
但运气不佳(代码只是给我留下了随机的行,例如 28-29 左右)。这需要在整个文件(大约 18000 行)中完成。
提前谢谢大家!
如果您想按创建顺序保留最后一行,请按
created_at
排序,然后调用 drop_duplicates
保留最后一行:
# Sort by date and keep the last value
df = df.sort_values("created_at")
df = df.drop_duplicates("article_id", keep="last")
# Convert dates to format YYYY-MM-DD
df["created_at"] = df["created_at"].dt.strftime("%Y-%M-%d")
结果:
created_at user_id user_email article_id key value
0 2023-04-05 111111 [email protected] 1 included -1
1 2023-05-05 111111 [email protected] 2 included -1
6 2023-06-05 111111 [email protected] 3 included -1