Python pandas 两表匹配查找最新日期

问题描述 投票:0回答:3

我想在 pandas 中做一些匹配,比如 Excel 中的 Vlookup。根据表1中的一些条件,找到表2中的最新日期:

表一:

Name  Threshold1   Threshold2
A     9            8
B     14           13

表二:

Date   Name   Value   
1/1    A      10
1/2    A      9
1/3    A      9
1/4    A      8
1/5    A      8
1/1    B      15
1/2    B      14
1/3    B      14
1/4    B      13
1/5    B      13

想要的桌子是这样的:

Name  Threshold1   Threshold1_Date   Threshold2   Threshold2_Date
A     9            1/3               8            1/5
B     14           1/3               13           1/5

提前致谢!

python pandas dataframe match lookup
3个回答
3
投票

代码

# assuming dataframe is already sorted on `date`
# drop the duplicates per Name and Value keeping the max date
cols = ['Name', 'Value']
s = df2.drop_duplicates(cols, keep='last').set_index(cols)['Date']

# for each threshold column use multindex.map to substitute 
# values from df2 based on matching Name and Threshold value
for c in df1.filter(like='Threshold'):
    df1[c + '_date'] = df1.set_index(['Name', c]).index.map(s)

结果

  Name  Threshold1  Threshold2 Threshold1_date Threshold2_date
0    A           9           8             1/3             1/5
1    B          14          13             1/3             1/5

2
投票

这个有用吗?

(df_out := df1.melt('Name', value_name='Value')\
   .merge(df2, on=['Name', 'Value'])\
   .sort_values('Date')\
   .drop_duplicates(['Name', 'variable'], keep='last')\
   .set_index(['Name', 'variable'])\
   .unstack().sort_index(level=1, axis=1))\
.set_axis(df_out.columns.map('_'.join), axis=1).reset_index()

输出:

  Name Date_Threshold1  Value_Threshold1 Date_Threshold2  Value_Threshold2
0    A             1/3                 9             1/5                 8
1    B             1/3                14             1/5                13

0
投票

这里有一种方法可以解决您的问题:

idxByThreshCol = ( df1.set_index('Name').pipe(lambda d:
    {col: d[[col]]
        .rename(columns={col:'Value'})
        .set_index('Value', append=True) for col in d.columns}) )
latestDtByNameVal = df2.groupby(['Name','Value']).last()
res = ( df1
    .assign(**{f'{k}_Date': latestDtByNameVal.loc[v.index,'Date'].to_numpy() 
        for k, v in idxByThreshCol.items()}) )

如果您希望结果列按照您的问题进行排序,您可以添加以下内容:

from itertools import chain
res = res[['Name'] + list(chain.from_iterable([[col, f'{col}_Date'] for col in df1.drop(columns='Name').columns]))]

输出:

  Name  Threshold1 Threshold1_Date  Threshold2 Threshold2_Date
0    A           9             1/3           8             1/5
1    B          14             1/3          13             1/5
© www.soinside.com 2019 - 2024. All rights reserved.