[Impala表中有类型2维度,该表具有〜500M行,102列:(C1,C2,...,C8,... C100,Eff_DT,EXP_DT)只需要选择具有不同组合值(C1,C2,... C8)的行。对于每个选定的记录,EFF_DT和EXP_DT必须分别是该记录所属组的min(eff_dt)和max(eff_dt)(此处的组由不同的组合(C1,C2,..,C8定义)
简单的分组依据不会在这里解决问题,因为它将忽略同一分组的时滞...
为了简单起见,这是必需的,也是我尝试过的-假设只有2列定义了一个组(而不是8)这是一个输入,所需输出和仅使用group by ...]进行输出的示例
--INPUT --DESIRED OUTPUT --OUTPUT of SIMPLE GROUP BY ------------------------------------------------------------------------------------------------------------ C1 C2 EFF_DT EXP_DT C1 C2 Eff_dt EXP_DT C1 C2 EFF_DT EXP_DT 4 8 2013-11-30 2014-01-22 4 8 2013-11-30 2014-01-22 4 8 2013-11-30 2999-12-31 2 8 2014-01-23 2014-01-23 2 8 2014-01-23 2014-01-23 2 8 2014-01-23 2014-01-23 4 8 2014-01-24 2015-12-31 4 8 2014-01-24 2999-12-31 4 8 2016-01-01 2016-12-31 4 8 2017-01-01 2018-03-15 4 8 2018-03-16 2018-07-24 4 8 2018-07-25 2999-12-31
试图在select语句中使用子查询来基于当前行选择max(exp_dt),但由于impala不起作用而无法工作.....
这是我尝试过的查询,它运行正常,但在Impala中不起作用(因为在select语句中不支持子查询)>
SELECT T0.C1, T0.C2, MIN(T0.EFF_DT) AS MIN_EFF_DT, T0.EXP_DT FROM ( SELECT T1.C1, T1.C2, T1.EFF_DT, ( SELECT MAX(T2.EXP_DT) FROM (select * from TABLE_NAME ) T2 WHERE T2.C1 = T1.C1 AND T2.C2 = T1.C2 AND NOT EXISTS ( SELECT 1 FROM (select * from TABLE_NAME) T3 WHERE T3.EXP_DT < T2.EXP_DT AND T3.EXP_DT > T1.EXP_DT AND (T3.C1 <> T2.C1 OR T3.C2 <> T2.C2 ) ) ) EXP_DT FROM (select * from TABLE_NAME) T1 ) T0 GROUP BY T0.C1, T0.C2, T0.EXP_DT ORDER BY MIN_EFF_DT ASC
具有在Impala表中存在类型2维度,该表具有〜500M行,具有102列:(C1,C2,...,C8,... C100,Eff_DT,EXP_DT)仅需要选择具有不同组合的行的值...
[很可能在修改id
列后,以前的解决方案将起作用:
select id, c1, c2, min(eff_dt), max(exp_dt)
from (select t.*,
row_number() over (partition by id order by eff_dt) as seqnum,
row_number() over (partition by id, c1, c2 order by eff_dt) as seqnum_1
from t
) t
group by id, c1, c2, (seqnum - seqnum_1);