根据另一个数据集中的时间戳创建新列

Question

假设我们有两个文件：价格和交易。 Prices有两列：price和publishedTime表示该价格，如下所示：

price   publishedTime
5.05    2020-01-01 11:00:06.122356 
9.87    2020-01-01 11:00:05.289655
6.37    2020-01-01 11:00:05.111234
8.22    2020-01-01 11:00:04.242103
... (millions of rows)

Transactions有两列：transactionID和transactionTime

transactionID    transactionTime
1001             2020-01-01 11:00:07.005477
2001             2020-01-01 11:00:06.110982
3005             2020-01-01 11:00:05.175564
4002             2020-01-01 11:00:05.152234
... (millions of rows)

对于每个transactionID，我们要查找其发布时间早于或等于transactionTime的第一个价格，例如对于上述交易，输出应如下所示：

transactionID transactionTime               Price
1001          2020-01-01 11:00:07.005477    5.05
2001          2020-01-01 11:00:06.110982    9.87
3005          2020-01-01 11:00:05.175564    6.37
4002          2020-01-01 11:00:05.152234    6.37
... (millions of rows)

我通过联合prices和transactions解决了这一问题，对timestamp列进行降序排序，然后使用尾部递归函数遍历整个数组，并在交易旁边找到第一个价格。我的问题是一个悬而未决的问题：该问题有哪些替代或“更好”的解决方案？ SQL，Spark等？

Answer 1

这在Spark中非常棘手。我认为带有窗口功能的union all方法会起作用：

with t as (
      select t.transactionTime, t.transactionId, null
      from transactions t
      union all
      select p.publishedTime, null, p.price
      from prices p
     )
select t.*
from (select transactionTime, transactionId, max(price) over (partition by grp) as price
      from (select t.*, count(price) over (order by transactionTime) as grp
            from t
           ) t
     ) t
where transactionId is not null;

这将合并两个表中的行，将价格分配给所有适当的交易，然后仅过滤回交易。

根据另一个数据集中的时间戳创建新列

问题描述投票：-1回答：1

1个回答

最新问题

根据另一个数据集中的时间戳创建新列

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1