Pandas 基于 timedelta 的条件外连接(merge_asof)

问题描述 投票:0回答:1

我有多个数据帧,需要根据唯一标识符 (uid) 以及每个数据帧中日期之间的时间增量将它们合并到单个数据集。

这是数据框的简化示例:

df1

   uid    tx_date last_name first_name  meas_1
0   60 2004-01-11      John      Smith     1.3
1   60 2016-12-24      John      Smith     2.4
2   61 1994-05-05     Betty      Jones     1.2
3   63 2006-07-19     James       Wood     NaN
4   63 2008-01-03     James       Wood     2.9
5   65 1998-10-08       Tom      Plant     4.2
6   66 2000-02-01     Helen       Kerr     1.1

df2

   uid    rx_date last_name first_name meas_2
0   60 2004-01-14      John      Smith      A
1   60 2017-01-05      John      Smith     AB
2   60 2017-03-31      John      Smith    NaN
3   63 2006-07-21     James       Wood      A
4   64 2002-04-18      Bill    Jackson      B
5   65 1998-10-08       Tom      Plant     AA
6   65 2005-12-01       Tom      Plant      B
7   66 2013-12-14     Helen       Kerr      C

基本上,我试图合并来自两个不同来源的同一个人的记录,其中唯一个体的记录之间的链接是“uid”,每个个体的行(如果存在)之间的链接是“ tx_date' 和 'rx_date' 可以(通常)容纳特定的时间增量。日期之间并不总是存在精确或模糊匹配,除“uid”之外的任何列中都可能缺少数据,并且每个数据帧将包含不同但相交的“uid”子集。

我需要能够连接“uid”列匹配的行,以及“tx_date”和“rx_date”之间的绝对时间增量在给定范围内的行(例如,最大增量为 14 天)。如果时间增量超出该范围,或者“tx_date”或“rx_date”之一丢失,或者“uid”仅存在于一个数据帧中,我仍然需要保留该行中的数据。最终结果应该是这样的:

    uid    tx_date    rx_date first_name last_name  meas_1 meas_2
0    60 2004-01-11 2004-01-14       John     Smith     1.3      A
1    60 2016-12-24 2017-01-05       John     Smith     2.4     AB
2    60        NaT 2017-03-31       John     Smith     NaN    NaN
3    61 1994-05-05        NaT      Betty     Jones     1.2    NaN
4    63 2006-07-19 2006-07-21      James      Wood     NaN      A
5    63 2008-01-03        NaT      James      Wood     NaN    NaN
6    64 2002-04-18        NaT       Bill   Jackson     NaN      B
7    65 1998-10-08 1998-10-08        Tom     Plant     4.2     AA
8    65        NaT 2005-12-01        Tom     Plant     NaN      B
9    66 2000-02-01        NaT      Helen      Kerr     1.1    NaN
10   66        NaT 2013-12-14      Helen      Kerr     NaN      C

似乎 pandas.merge_asof 在这里应该很有用,但我无法让它完成我需要的工作。

在两个真实的数据帧上尝试 merge_asof 时出现错误

ValueError: left keys must be sorted

根据这个问题,问题实际上是由于某些行的“日期”列中存在 NaT 值。我删除了具有 NaT 值的行,并对每个数据帧中的“日期”列进行了排序,但结果仍然不完全是我需要的。

下面的代码显示了所采取的步骤。

import pandas as pd


df1['date'] = df1['tx_date']
df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = df1['date'].dropna()
df1 = df1.sort_values('date')

df2['date'] = df2['rx_date']
df2['date'] = pd.to_datetime(df2['date'])
df2['date'] = df2['date'].dropna()
df2 = df2.sort_values('date')

df_merged = (pd.merge_asof(df1, df2, on='date', by='uid', tolerance=pd.Timedelta('14 days'))).sort_values('uid')

结果:

   uid    tx_date    rx_date last_name_x first_name_x  meas_1 meas_2
3   60 2004-01-11 2004-01-14        John        Smith     1.3      A
6   60 2016-12-24 2017-01-05        John        Smith     2.4     AB
0   61 1994-05-05        NaT       Betty        Jones     1.2    NaN
4   63 2006-07-19 2006-07-21       James         Wood     NaN      A
5   63 2008-01-03        NaT       James         Wood     2.9    NaN
1   65 1998-10-08 1998-10-08         Tom        Plant     4.2     AA
2   66 2000-02-01        NaT       Helen         Kerr     1.1    NaN   

它看起来像是左连接而不是完整的外连接,因此 df2 中任何与 df1 中的“uid”和“日期”不匹配的行都会丢失(从这个简化的示例中还不清楚,但我也需要将行添加回日期为 NaT 的位置)。

是否有某种方法可以实现无损合并,无论是通过某种方式使用 merge_asof 进行外连接,还是使用其他方法?

python pandas dataframe merge timedelta
1个回答
0
投票

conditonal_join 处理不等式连接,应该对您的用例有所帮助:

# pip install pyjanitor
import pandas as pd
import janitor

out = (df1.conditional_join(
             # create temporary columns to handle the 14 day window
             df2.assign(start=df2.rx_date - pd.Timedelta(days=14), 
                        end = df2.rx_date), 
       # column from the left, column from the right, operator
       ('uid','uid', '=='), 
       ('tx_date', 'start', '>='), 
       ('tx_date', 'end', '<='), 
       how='outer')
)
out['left'] = out['left'].fillna(out.right)
res={'uid':out.left.uid,
     'tx_date':out.left.tx_date, 
     'rx_date':out.right.rx_date,
     'first_name':out.left.first_name, 
     'last_name':out.left.last_name,
     'meas_1':out.left.meas_1,
     'meas_2':out.right.meas_2}

pd.DataFrame(res).sort_values(['uid','tx_date'])

     uid              tx_date    rx_date first_name last_name  meas_1 meas_2
3   60.0  2004-01-11 00:00:00 2004-01-14      Smith      John     1.3      A
4   60.0  2016-12-24 00:00:00 2017-01-05      Smith      John     2.4     AB
7   60.0                  NaN 2017-03-31      Smith      John     NaN    NaN
0   61.0  1994-05-05 00:00:00        NaT      Jones     Betty     1.2    NaN
5   63.0  2006-07-19 00:00:00 2006-07-21       Wood     James     NaN      A
1   63.0  2008-01-03 00:00:00        NaT       Wood     James     2.9    NaN
8   64.0                  NaN 2002-04-18    Jackson      Bill     NaN      B
6   65.0  1998-10-08 00:00:00 1998-10-08      Plant       Tom     4.2     AA
9   65.0                  NaN 2005-12-01      Plant       Tom     NaN      B
2   66.0  2000-02-01 00:00:00        NaT       Kerr     Helen     1.1    NaN
10  66.0                  NaN 2013-12-14       Kerr     Helen     NaN      C

© www.soinside.com 2019 - 2024. All rights reserved.