连接两个没有任何重复值的 df

问题描述 投票:0回答:1

我想根据日期连接两个数据框(即 df_1、df_2),以便形成一个新的 df_1_2,列为日期、A、B、C,其中所有日期及其各自的值都存在,没有任何重复值

当前代码:

import pandas as pd

# create dictionary

dict1 = {"Date":["2000-01-01", "2000-01-04", "2000-01-05", "2000-01-07"], "A":[99, 93,100,97], "B": [106,107,109,105]}
dict2 = {"Date":["2000-01-01", "2000-01-03", "2000-01-05", "2000-01-07"], "A":[99, 96,100,97], "B": [106,100,109,105], "C":[2,5,8,4]}

# create dataframe using dict1

df_1 = pd.DataFrame(dict1)
df_1["Date"] = pd.to_datetime(df_1["Date"])
df_1.set_index("Date", inplace = True)

# create dataframe using dict2

df_2 = pd.DataFrame(dict2)
df_2["Date"] = pd.to_datetime(df_2["Date"])
df_2.set_index("Date", inplace = True)

# concat df_1 & df_2

df_1_2 = pd.concat([df_1, df_2])
print(df_1_2)

预期输出:

pandas dataframe join merge concatenation
1个回答
0
投票

我理解对了吗?

  • 通过重复,您的意思是相同的行,无论“C”如何
  • 如果出现此类重复,您希望保留 df_2 中的行,因为它具有额外信息('C')

如果这是正确的,因为这些不是熊猫意义上的重复:

  • 像你一样进行连接
    df_1_2 = pd.concat([df_1, df_2])
  • 然后处理重复项(标记行,然后删除)
# Make sure rows from df_1_2 are place on top
df_1_2.sort_values(by='C', inplace=True)

# Mark duplicates with regards to 'A' and 'B'
df_1_2['duplicate'] = df_1_2.drop(['C'], axis=1).duplicated()

中间输出:

              A    B    C  duplicate
Date                                
2000-01-01   99  106  2.0      False
2000-01-07   97  105  4.0      False
2000-01-03   96  100  5.0      False
2000-01-05  100  109  8.0      False
2000-01-01   99  106  NaN       True
2000-01-04   93  107  NaN      False
2000-01-05  100  109  NaN       True
2000-01-07   97  105  NaN       True
# Remove duplicates, drop 'duplicate' temporary column, then restore order by date (index). 
df_1_2.loc[~df_1_2.duplicate].drop('duplicate', axis=1).sort_index()

输出:

              A    B    C
Date                     
2000-01-01   99  106  2.0
2000-01-03   96  100  5.0
2000-01-04   93  107  NaN
2000-01-05  100  109  8.0
2000-01-07   97  105  4.0
© www.soinside.com 2019 - 2024. All rights reserved.