根据另一个数据集中的时间戳创建新列

问题描述 投票:-1回答:1

假设我们有两个文件:价格交易Prices有两列:pricepublishedTime表示该价格,如下所示:

price   publishedTime
5.05    2020-01-01 11:00:06.122356 
9.87    2020-01-01 11:00:05.289655
6.37    2020-01-01 11:00:05.111234
8.22    2020-01-01 11:00:04.242103
... (millions of rows)

Transactions有两列:transactionIDtransactionTime

transactionID    transactionTime
1001             2020-01-01 11:00:07.005477
2001             2020-01-01 11:00:06.110982
3005             2020-01-01 11:00:05.175564
4002             2020-01-01 11:00:05.152234
... (millions of rows)

对于每个transactionID,我们要查找其发布时间早于或等于transactionTime的第一个价格,例如对于上述交易,输出应如下所示:

transactionID transactionTime               Price
1001          2020-01-01 11:00:07.005477    5.05
2001          2020-01-01 11:00:06.110982    9.87
3005          2020-01-01 11:00:05.175564    6.37
4002          2020-01-01 11:00:05.152234    6.37
... (millions of rows)

我通过联合pricestransactions解决了这一问题,对timestamp列进行降序排序,然后使用尾部递归函数遍历整个数组,并在交易旁边找到第一个价格。我的问题是一个悬而未决的问题:该问题有哪些替代或“更好”的解决方案? SQL,Spark等?

sql sorting apache-spark mapreduce
1个回答
0
投票

这在Spark中非常棘手。我认为带有窗口功能的union all方法会起作用:

with t as (
      select t.transactionTime, t.transactionId, null
      from transactions t
      union all
      select p.publishedTime, null, p.price
      from prices p
     )
select t.*
from (select transactionTime, transactionId, max(price) over (partition by grp) as price
      from (select t.*, count(price) over (order by transactionTime) as grp
            from t
           ) t
     ) t
where transactionId is not null;

这将合并两个表中的行,将价格分配给所有适当的交易,然后仅过滤回交易。

© www.soinside.com 2019 - 2024. All rights reserved.