我有多个数据帧,需要根据唯一标识符 (uid) 以及每个数据帧中日期之间的时间增量将它们合并到单个数据集。
这是数据框的简化示例:
df1
uid tx_date last_name first_name meas_1
0 60 2004-01-11 John Smith 1.3
1 60 2016-12-24 John Smith 2.4
2 61 1994-05-05 Betty Jones 1.2
3 63 2006-07-19 James Wood NaN
4 63 2008-01-03 James Wood 2.9
5 65 1998-10-08 Tom Plant 4.2
6 66 2000-02-01 Helen Kerr 1.1
df2
uid rx_date last_name first_name meas_2
0 60 2004-01-14 John Smith A
1 60 2017-01-05 John Smith AB
2 60 2017-03-31 John Smith NaN
3 63 2006-07-21 James Wood A
4 64 2002-04-18 Bill Jackson B
5 65 1998-10-08 Tom Plant AA
6 65 2005-12-01 Tom Plant B
7 66 2013-12-14 Helen Kerr C
基本上,我试图合并来自两个不同来源的同一个人的记录,其中唯一个体的记录之间的链接是“uid”,每个个体的行(如果存在)之间的链接是“ tx_date' 和 'rx_date' 可以(通常)容纳特定的时间增量。日期之间并不总是存在精确或模糊匹配,除“uid”之外的任何列中都可能缺少数据,并且每个数据帧将包含不同但相交的“uid”子集。
我需要能够连接“uid”列匹配的行,以及“tx_date”和“rx_date”之间的绝对时间增量在给定范围内的行(例如,最大增量为 14 天)。如果时间增量超出该范围,或者“tx_date”或“rx_date”之一丢失,或者“uid”仅存在于一个数据帧中,我仍然需要保留该行中的数据。最终结果应该是这样的:
uid tx_date rx_date first_name last_name meas_1 meas_2
0 60 2004-01-11 2004-01-14 John Smith 1.3 A
1 60 2016-12-24 2017-01-05 John Smith 2.4 AB
2 60 NaT 2017-03-31 John Smith NaN NaN
3 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood NaN NaN
6 64 2002-04-18 NaT Bill Jackson NaN B
7 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
8 65 NaT 2005-12-01 Tom Plant NaN B
9 66 2000-02-01 NaT Helen Kerr 1.1 NaN
10 66 NaT 2013-12-14 Helen Kerr NaN C
似乎 pandas.merge_asof 在这里应该很有用,但我无法让它完成我需要的工作。
在两个真实的数据帧上尝试 merge_asof 时出现错误
ValueError: left keys must be sorted
根据这个问题,问题实际上是由于某些行的“日期”列中存在 NaT 值。我删除了具有 NaT 值的行,并对每个数据帧中的“日期”列进行了排序,但结果仍然不完全是我需要的。
下面的代码显示了所采取的步骤。
import pandas as pd
df1['date'] = df1['tx_date']
df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = df1['date'].dropna()
df1 = df1.sort_values('date')
df2['date'] = df2['rx_date']
df2['date'] = pd.to_datetime(df2['date'])
df2['date'] = df2['date'].dropna()
df2 = df2.sort_values('date')
df_merged = (pd.merge_asof(df1, df2, on='date', by='uid', tolerance=pd.Timedelta('14 days'))).sort_values('uid')
结果:
uid tx_date rx_date last_name_x first_name_x meas_1 meas_2
3 60 2004-01-11 2004-01-14 John Smith 1.3 A
6 60 2016-12-24 2017-01-05 John Smith 2.4 AB
0 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood 2.9 NaN
1 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
2 66 2000-02-01 NaT Helen Kerr 1.1 NaN
它看起来像是左连接而不是完整的外连接,因此 df2 中任何与 df1 中的“uid”和“日期”不匹配的行都会丢失(从这个简化的示例中还不清楚,但我也需要将行添加回日期为 NaT 的位置)。
是否有某种方法可以实现无损合并,无论是通过某种方式使用 merge_asof 进行外连接,还是使用其他方法?
conditonal_join 处理不等式连接,应该对您的用例有所帮助:
# pip install pyjanitor
import pandas as pd
import janitor
out = (df1.conditional_join(
# create temporary columns to handle the 14 day window
df2.assign(start=df2.rx_date - pd.Timedelta(days=14),
end = df2.rx_date),
# column from the left, column from the right, operator
('uid','uid', '=='),
('tx_date', 'start', '>='),
('tx_date', 'end', '<='),
how='outer')
)
out['left'] = out['left'].fillna(out.right)
res={'uid':out.left.uid,
'tx_date':out.left.tx_date,
'rx_date':out.right.rx_date,
'first_name':out.left.first_name,
'last_name':out.left.last_name,
'meas_1':out.left.meas_1,
'meas_2':out.right.meas_2}
pd.DataFrame(res).sort_values(['uid','tx_date'])
uid tx_date rx_date first_name last_name meas_1 meas_2
3 60.0 2004-01-11 00:00:00 2004-01-14 Smith John 1.3 A
4 60.0 2016-12-24 00:00:00 2017-01-05 Smith John 2.4 AB
7 60.0 NaN 2017-03-31 Smith John NaN NaN
0 61.0 1994-05-05 00:00:00 NaT Jones Betty 1.2 NaN
5 63.0 2006-07-19 00:00:00 2006-07-21 Wood James NaN A
1 63.0 2008-01-03 00:00:00 NaT Wood James 2.9 NaN
8 64.0 NaN 2002-04-18 Jackson Bill NaN B
6 65.0 1998-10-08 00:00:00 1998-10-08 Plant Tom 4.2 AA
9 65.0 NaN 2005-12-01 Plant Tom NaN B
2 66.0 2000-02-01 00:00:00 NaT Kerr Helen 1.1 NaN
10 66.0 NaN 2013-12-14 Kerr Helen NaN C