我想根据日期连接两个数据框(即 df_1、df_2),以便形成一个新的 df_1_2,列为日期、A、B、C,其中所有日期及其各自的值都存在,没有任何重复值
当前代码:
import pandas as pd
# create dictionary
dict1 = {"Date":["2000-01-01", "2000-01-04", "2000-01-05", "2000-01-07"], "A":[99, 93,100,97], "B": [106,107,109,105]}
dict2 = {"Date":["2000-01-01", "2000-01-03", "2000-01-05", "2000-01-07"], "A":[99, 96,100,97], "B": [106,100,109,105], "C":[2,5,8,4]}
# create dataframe using dict1
df_1 = pd.DataFrame(dict1)
df_1["Date"] = pd.to_datetime(df_1["Date"])
df_1.set_index("Date", inplace = True)
# create dataframe using dict2
df_2 = pd.DataFrame(dict2)
df_2["Date"] = pd.to_datetime(df_2["Date"])
df_2.set_index("Date", inplace = True)
# concat df_1 & df_2
df_1_2 = pd.concat([df_1, df_2])
print(df_1_2)
预期输出:
我理解对了吗?
如果这是正确的,因为这些不是熊猫意义上的重复:
df_1_2 = pd.concat([df_1, df_2])
# Make sure rows from df_1_2 are place on top
df_1_2.sort_values(by='C', inplace=True)
# Mark duplicates with regards to 'A' and 'B'
df_1_2['duplicate'] = df_1_2.drop(['C'], axis=1).duplicated()
中间输出:
A B C duplicate
Date
2000-01-01 99 106 2.0 False
2000-01-07 97 105 4.0 False
2000-01-03 96 100 5.0 False
2000-01-05 100 109 8.0 False
2000-01-01 99 106 NaN True
2000-01-04 93 107 NaN False
2000-01-05 100 109 NaN True
2000-01-07 97 105 NaN True
# Remove duplicates, drop 'duplicate' temporary column, then restore order by date (index).
df_1_2.loc[~df_1_2.duplicate].drop('duplicate', axis=1).sort_index()
输出:
A B C
Date
2000-01-01 99 106 2.0
2000-01-03 96 100 5.0
2000-01-04 93 107 NaN
2000-01-05 100 109 8.0
2000-01-07 97 105 4.0