我有 2 个要合并的 DataFrame。
import pandas as pd
df1=pd.DataFrame.from_dict({
'names':['klas','erik','stefan'],
'age':[6,17,28]
})
df2=pd.DataFrame.from_dict(
{'salary':[10,15,45,600],
'names':['klas','erik','stefan','stefan'],
'age_range':pd.IntervalIndex.from_tuples([(0,10),(10,20),(20,30),(0,10)])
})
df1:
names age
0 klas 6
1 erik 17
2 stefan 28
df2:
salary names age_range
0 10 klas (0, 10]
1 15 erik (10, 20]
2 45 stefan (20, 30]
3 600 stefan (0, 10]
如果我在名称列上合并,则最后一行中的年龄不在间隔内 在区间列 ((0, 10]):
m1=df1.merge(df2,on='names',how='left')
print(m1)
names age salary age_range
0 klas 6 10 (0, 10]
1 erik 17 15 (10, 20]
2 stefan 28 45 (20, 30]
3 stefan 28 600 (0, 10]
我真正想做的是合并键/列年龄和年龄范围。 问题是年龄是一列带有浮点数的列,而年龄范围是一列 与 pandas 间隔。不幸的是,不可能像这样进行合并:
df1.merge(df2,left_on=['age','names'],right_on=['age_range','names'],how='left')
这只会产生包含 nan 的薪资列。 为了当前获得结果,我希望我必须对名称进行合并,然后 做类似的事情:
def check_if_age_between(age,age_range):
return age in age_range
f1=lambda row: check_if_age_between(row['age'],row['age_range'])
m1=m1[m1.apply(f1,axis=1)]
print(m1)
names age salary age_range
0 klas 6 10 (0, 10]
1 erik 17 15 (10, 20]
2 stefan 28 45 (20, 30]
有没有办法使用 float 的键与 2 个 df:s 进行合并 第一个 df 中的列和第二个 df 中的 pandas 间隔列?
您可以使用
pd.cut
将年龄范围应用于第一个数据框,然后像这样合并:
df1['age_range'] = pd.cut(df1['age'], bins=[0,10,20,30,40])
df_out = df1.merge(df2, on = ['names', 'age_range'])
df_out
输出:
names age age_range salary
0 klas 6 (0, 10] 10
1 erik 17 (10, 20] 15
2 stefan 28 (20, 30] 45
conditional_join涵盖了您的用例 - 您从间隔数组创建开始和结束的临时列:
# pip install pyjanitor
import janitor
import pandas as pd
(df1
.conditional_join(
df2.assign(start=df2.age_range.array.left,
end=df2.age_range.array.right),
# column from the left, column from the right, operator
('names', 'names', '=='),
('age', 'start', '>='),
('age', 'end', '<='),
# columns to return from the right dataframe
right_columns=['salary','age_range'],
# more performance may be possible in numba
# if you have many duplicated values in the equality join
use_numba=False,
# you may force the inequality join to execute first
# if you know that there are less rows to return
# compared to the inequality join
force=False,
how = 'inner')
)
names age salary age_range
0 klas 6 10 (0, 10]
1 erik 17 15 (10, 20]
2 stefan 28 45 (20, 30]