Python 数据帧+时间戳中的重复删除

问题描述 投票:0回答:1

这里完全是新手。将尝试尽可能清楚地解释它:)

所以,我已经收到了这个日志.csv 文件,请参阅下文。

created_at                  user_id user_email          article_id  key      value
2023-12-05T20:04:45.088Z    111111  [email protected]           1      included   -1
2023-12-05T20:05:32.964Z    111111  [email protected]           2      included   -1
2023-12-05T20:06:31.980Z    111111  [email protected]           3      included   -1
2023-12-05T20:06:33.730Z    111111  [email protected]           3      included   -1
2023-12-05T20:06:36.387Z    111111  [email protected]           3      included    1
2023-12-05T20:06:38.621Z    111111  [email protected]           3      included    1
2023-12-05T20:06:56.200Z    111111  [email protected]           3      included   -1

我想解决几个问题:

  1. 请参阅article_id #3 的条目。我需要删除最后一项之前的所有条目。我尝试过应用以下
data_last = df.drop_duplicates(subset=['article_id'], keep='last'

但运气不佳(代码只是给我留下了随机的行,例如 28-29 左右)。这需要在整个文件(大约 18000 行)中完成。

  1. 我还需要转换时间(将“created_at”时间保留为年月日,因为我需要选择特定日期之前的信息)。

提前谢谢大家!

python pandas dataframe timestamp
1个回答
0
投票

如果您想按创建顺序保留最后一行,请按

created_at
排序,然后调用
drop_duplicates
保留最后一行:

# Sort by date and keep the last value
df = df.sort_values("created_at")
df = df.drop_duplicates("article_id", keep="last")
# Convert dates to format YYYY-MM-DD
df["created_at"] = df["created_at"].dt.strftime("%Y-%M-%d")

结果:

   created_at  user_id     user_email  article_id       key  value
0  2023-04-05   111111  [email protected]           1  included     -1
1  2023-05-05   111111  [email protected]           2  included     -1
6  2023-06-05   111111  [email protected]           3  included     -1
© www.soinside.com 2019 - 2024. All rights reserved.