如何在spark sql lag函数中添加if或case条件

问题描述 投票:0回答:1

需要在spark sql lag函数中添加一些条件

我的数据中有ID和日期,我想获得最近的非滞后日期。

id,date
er1,2018-01-19
er1,null
er1,2018-02-10
er2,2018-11-11
er2,null
er2,null
er2,null

select Id, date,
lag(date) PARTITION BY id order by date as last_date
from mytable

id,date,last_date
er1,2018-01-19,null
er1,null,2018-01-19
er1,2018-02-10,null
er2,2018-11-11,null
er2,null,2018-11-11
er2,null,null
er2,null,null

但我发现日期列中有空值,实际上我希望最后一个空日期为last_date,因此滞后函数的第二个参数未确定。我试图添加一列来指定前一行中的空值数或删除空行并加入它,但是有更好的解决方案吗?

我想得到这个

id,date,last_date
er1,2018-01-19,null
er1,null,2018-01-19
er1,2018-02-10,2018-01-19
er2,2018-11-11,null
er2,null,2018-11-11
er2,null,2018-11-11
er2,null,2018-11-11
sql apache-spark
1个回答
0
投票

标准的lag()函数有一个ignore nulls选项:

select Id, date,
       lag(date ignore nulls) over (PARTITION BY id order by date) as last_date
from mytable;

但并非所有数据库都支持此功能您可以使用子查询来模拟它:

select Id, date,
       min(date) over (partition by id, grp order by date) as last_date
from (select t.*,
             count(date) over (partition by id order by date) as grp
      from mytable t
     ) t
© www.soinside.com 2019 - 2024. All rights reserved.