需要在spark sql
lag
函数中添加一些条件
我的数据中有ID和日期,我想获得最近的非滞后日期。
id,date
er1,2018-01-19
er1,null
er1,2018-02-10
er2,2018-11-11
er2,null
er2,null
er2,null
select Id, date,
lag(date) PARTITION BY id order by date as last_date
from mytable
id,date,last_date
er1,2018-01-19,null
er1,null,2018-01-19
er1,2018-02-10,null
er2,2018-11-11,null
er2,null,2018-11-11
er2,null,null
er2,null,null
但我发现日期列中有空值,实际上我希望最后一个空日期为last_date,因此滞后函数的第二个参数未确定。我试图添加一列来指定前一行中的空值数或删除空行并加入它,但是有更好的解决方案吗?
我想得到这个
id,date,last_date
er1,2018-01-19,null
er1,null,2018-01-19
er1,2018-02-10,2018-01-19
er2,2018-11-11,null
er2,null,2018-11-11
er2,null,2018-11-11
er2,null,2018-11-11
标准的lag()
函数有一个ignore nulls
选项:
select Id, date,
lag(date ignore nulls) over (PARTITION BY id order by date) as last_date
from mytable;
但并非所有数据库都支持此功能您可以使用子查询来模拟它:
select Id, date,
min(date) over (partition by id, grp order by date) as last_date
from (select t.*,
count(date) over (partition by id order by date) as grp
from mytable t
) t