如何使用配置单元简化计算效率？

问题描述投票：1回答：1

代码在hive上运行：

select day,count(mdn)*5 as number from
(select distinct a.mdn,a.day from 
flow a
left outer join
flow b
on a.day=date_add(b.day,-1) and a.mdn=b.mdn
left outer join
flow c
on a.day=date_add(c.day,-2) and a.mdn=c.mdn
left outer join
flow d
on a.day=date_add(d.day,-3) and a.mdn=d.mdn
where b.mdn is null  and c.mdn is null  and d.mdn is null)t 
group by day

代码的逻辑是选择今天未出现三天的一个mdn，并计算mdn的数量。但是这个代码的效率是如此之低，因为三次加入相同的大表流。如何高效地简化它？

sql database hive pyspark

1个回答

1
投票

那么，您可以使用lead()查看第二天并比较日期时间：

select f.*
from (select f.*,
             lead(f.day) over (partition by f.mdn order by f.day) as next_day
      from flow f
     ) f
where next_day > date_add(day, 3) or next_date is null;

最新问题

© www.soinside.com 2019 - 2024. All rights reserved.