如何在熊猫中连接两个相等的数据帧，并通过ID区分重复？

Question

在python3和pandas中，我有两个结构相同的数据框

df_posts_final_1.info()
<class 'pandas.core.frame.DataFrame'>                                           
RangeIndex: 32669 entries, 0 to 32668
Data columns (total 12 columns):
post_id        32479 non-null object
text           31632 non-null object
post_text      30826 non-null object
shared_text    3894 non-null object
time           32616 non-null object
image          24585 non-null object
likes          32669 non-null object
comments       32669 non-null object
shares         32669 non-null object
post_url       26157 non-null object
link           4343 non-null object
cpf            32669 non-null object
dtypes: object(12)
memory usage: 3.0+ MB

df_posts_final_2.info()
<class 'pandas.core.frame.DataFrame'>                                           
RangeIndex: 33883 entries, 0 to 33882
Data columns (total 12 columns):
post_id        33698 non-null object
text           32755 non-null object
post_text      31901 non-null object
shared_text    3986 non-null object
time           33829 non-null object
image          25570 non-null object
likes          33883 non-null object
comments       33883 non-null object
shares         33883 non-null object
post_url       27286 non-null object
link           4446 non-null object
cpf            33883 non-null object
dtypes: object(12)
memory usage: 3.1+ MB

我想团结他们，我可以这样做：

frames = [df_posts_final_1, df_posts_final_1]
result = pd.concat(frames)

但是“ post_id”列具有唯一的标识代码。因此，当df_posts_final_1中有一个ID“ X”时，它不需要在最终数据帧结果中出现两次。

例如，如果代码“ FLK1989”出现在df_posts_final_1以及df_posts_final_2中，我仅留下df_posts_final_2中的最后一条记录

请，有人知道这样做的正确策略吗？

在python3和pandas中，我有两个结构相同的数据框df_posts_final_1.info（）RangeIndex：32669 ...

Answer 1

修复您的代码，添加groupby + tail

frames = [df_posts_final_1, df_posts_final_2]
result = pd.concat(frames).groupby('post_id').tail(1)

Answer 2

0
投票

尝试使用：

如何在熊猫中连接两个相等的数据帧，并通过ID区分重复？

问题描述投票：0回答：2

2个回答

最新问题

如何在熊猫中连接两个相等的数据帧，并通过ID区分重复？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2