将 DataFrame float 列与 DataFrame pandas Interval 列合并

问题描述 投票:0回答:2

我有 2 个要合并的 DataFrame。

import pandas as pd

df1=pd.DataFrame.from_dict({
'names':['klas','erik','stefan'],
'age':[6,17,28] 
})

df2=pd.DataFrame.from_dict(
{'salary':[10,15,45,600],
 'names':['klas','erik','stefan','stefan'],
 'age_range':pd.IntervalIndex.from_tuples([(0,10),(10,20),(20,30),(0,10)])    
})

df1:

    names   age
0   klas    6
1   erik    17
2   stefan  28

df2:

salary  names   age_range
0   10  klas    (0, 10]
1   15  erik    (10, 20]
2   45  stefan  (20, 30]
3   600 stefan  (0, 10]

如果我在名称列上合并,则最后一行中的年龄不在间隔内 在区间列 ((0, 10]):

m1=df1.merge(df2,on='names',how='left')
print(m1)

    names  age salary age_range
0   klas    6   10    (0, 10]
1   erik    17  15    (10, 20]
2   stefan  28  45    (20, 30]
3   stefan  28  600   (0, 10]

我真正想做的是合并键/列年龄和年龄范围。 问题是年龄是一列带有浮点数的列,而年龄范围是一列 与 pandas 间隔。不幸的是,不可能像这样进行合并:

df1.merge(df2,left_on=['age','names'],right_on=['age_range','names'],how='left')

这只会产生包含 nan 的薪资列。 为了当前获得结果,我希望我必须对名称进行合并,然后 做类似的事情:

def check_if_age_between(age,age_range):
    return age in age_range

f1=lambda row: check_if_age_between(row['age'],row['age_range'])

m1=m1[m1.apply(f1,axis=1)]

print(m1)


    names  age  salary age_range
0    klas    6      10   (0, 10]
1    erik   17      15  (10, 20]
2  stefan   28      45  (20, 30]

有没有办法使用 float 的键与 2 个 df:s 进行合并 第一个 df 中的列和第二个 df 中的 pandas 间隔列?

pandas dataframe merge intervals
2个回答
1
投票

您可以使用

pd.cut
将年龄范围应用于第一个数据框,然后像这样合并:

df1['age_range'] = pd.cut(df1['age'], bins=[0,10,20,30,40])
df_out = df1.merge(df2, on = ['names', 'age_range'])

df_out

输出:

    names  age age_range  salary
0    klas    6   (0, 10]      10
1    erik   17  (10, 20]      15
2  stefan   28  (20, 30]      45

1
投票

conditional_join涵盖了您的用例 - 您从间隔数组创建开始和结束的临时列:

# pip install pyjanitor
import janitor
import pandas as pd
(df1
.conditional_join(
    df2.assign(start=df2.age_range.array.left, 
               end=df2.age_range.array.right), 
    # column from the left, column from the right, operator
    ('names', 'names', '=='), 
    ('age', 'start', '>='), 
    ('age', 'end', '<='), 
    # columns to return from the right dataframe
    right_columns=['salary','age_range'], 
    # more performance may be possible in numba
    # if you have many duplicated values in the equality join
    use_numba=False,
    # you may force the inequality join to execute first
    # if you know that there are less rows to return
    # compared to the inequality join
    force=False,
    how = 'inner')
) 
    names  age  salary age_range
0    klas    6      10   (0, 10]
1    erik   17      15  (10, 20]
2  stefan   28      45  (20, 30]
© www.soinside.com 2019 - 2024. All rights reserved.