我想对2个pandas数据帧进行条件向上插入-类似于merge into SQL函数。对于源数据帧中的每一行,如果索引不存在,请将其插入到目标数据帧中。如果索引确实存在,请检查辅助条件。如果满足条件,请更新现有行。
这里是一个例子:
import pandas as pd
df1 = pd.DataFrame([{'index':'1st','checkval':2,'storeval':'elephant'},
{'index':'2nd','checkval':7,'storeval':'giraffe'}]).set_index('index')
df2 = pd.DataFrame([{'index':'1st','checkval':3,'storeval':'hippopotamus'},
{'index':'3rd','checkval':4,'storeval':'seagull'}]).set_index('index')
这是df1的外观
checkval storeval
index
1st 2 elephant
2nd 7 giraffe
这是df2的外观
checkval storeval
index
1st 3 hippopotamus
3rd 4 seagull
这是我所描述的蛮力方式:
for ind2, row2 in df2.iterrows():
found = False
for ind1, row1 in df1.iterrows():
if ind2 == ind1:
#Index matched
found = True
if row2['checkval'] > row1['checkval']:
#Conditions met, updating existing row
df1.loc[ind1] = row2
if not found:
# Row not already in df, insert
df1 = df1.append(row2)
输出为:
checkval storeval
index
1st 3 hippopotamus
2nd 7 giraffe
3rd 4 seagull
但是,我很想找到某种内置函数,例如
df1.merge(d2, how = 'left', conditions = lambda df1,df2: df2['checkval']>df1['checkval'])
或类似的东西。有没有人对如何改进“蛮力”方法有任何建议?
不要在熊猫中创建不必要的循环,这会减慢速度并弄乱代码
我认为我们可以将DataFrame.append
与以前的DataFrame.append
一起使用:
groupby.last
替代项:
groupby.last
new_df = df1.append(df2).sort_values('checkval').groupby(level=0).last()
#new_df = df1.append(df2).sort_values('checkval').groupby(level='index').last()
输出
new_df = df1.append(df2)
new_df = new_df.loc[~new_df.sort_values('checkval')
.index
.duplicated(keep='last'),:].sort_index()
print(new_df)