我在Hive中有以下数据:
id sequence app time1 time2 first_d_seq last_d_seq
2456 1 a 10/11/2018 10:25:43 10/11/2018 10:25:47 5 6
2456 2 b 10/11/2018 10:25:48 10/11/2018 10:25:55 5 6
2456 3 b 10/11/2018 10:25:58 10/11/2018 10:26:02 5 6
2456 4 c 10/11/2018 10:26:02 10/11/2018 10:26:08 5 6
2456 5 d 10/11/2018 10:26:08 10/11/2018 10:26:13 5 6
2456 6 d 10/11/2018 10:26:15 10/11/2018 10:26:20 5 6
2456 7 f 10/11/2018 10:26:20 10/11/2018 10:26:28 5 6
2456 8 f 10/11/2018 10:26:32 10/11/2018 10:26:39 5 6
9702 1 a 10/11/2018 11:05:14 10/11/2018 11:05:16 3 3
9702 2 b 10/11/2018 11:05:16 10/11/2018 11:05:20 3 3
9702 3 d 10/11/2018 11:05:20 10/11/2018 11:05:25 3 3
9702 4 h 10/11/2018 11:05:25 10/11/2018 11:05:27 3 3
9702 5 f 10/11/2018 11:05:27 10/11/2018 11:05:36 3 3
我知道app d
在每个id
组的序列中开始和结束的位置(即,第一组d
从序列= 5开始,到序列= 6结束)。
对于每个id
组,我想计算的是1)从一开始(sequence=1
)到d
(sequence = first_d_seq - 1
)首次出现所花费的时间,以及2)从d
(sequence = last_d_seq + 1
)之后直到结束所花费的时间该id的序列(即,id = 2456的8
; id = 9702的5
)。
基本上,输出应该是这样的:
id before_d after_d
2456 25 19
9702 6 11
在同事的帮助下,我们找到了以下解决方案:
with x as ( --raw data table
select stack(13,
2456,1,'a','10/11/2018 10:25:43','10/11/2018 10:25:47',5,6,
2456,2,'b','10/11/2018 10:25:48','10/11/2018 10:25:55',5,6,
2456,3,'b','10/11/2018 10:25:58','10/11/2018 10:26:02',5,6,
2456,4,'c','10/11/2018 10:26:02','10/11/2018 10:26:08',5,6,
2456,5,'d','10/11/2018 10:26:08','10/11/2018 10:26:13',5,6,
2456,6,'d','10/11/2018 10:26:15','10/11/2018 10:26:20',5,6,
2456,7,'f','10/11/2018 10:26:20','10/11/2018 10:26:28',5,6,
2456,8,'f','10/11/2018 10:26:32','10/11/2018 10:26:39',5,6,
9702,1,'a','10/11/2018 11:05:14','10/11/2018 11:05:16',3,3,
9702,2,'b','10/11/2018 11:05:16','10/11/2018 11:05:20',3,3,
9702,3,'d','10/11/2018 11:05:20','10/11/2018 11:05:25',3,3,
9702,4,'h','10/11/2018 11:05:25','10/11/2018 11:05:27',3,3,
9702,5,'f','10/11/2018 11:05:27','10/11/2018 11:05:36',3,3
) as (id,sequence,app,time1,time2,first_d_seq,last_d_seq)
) -- select * from x
,
y as (
select id,
min(sequence)-1 as min_seq,
max(sequence)+1 as max_seq
from x
where app='d'
group by id
) -- select * from y
,
z as (
select id,
min(time1) as mintime_all,
max(time2) as maxtime_all
--min(unix_timestamp(startdtm, 'MM/dd/yyyy HH:mm:ss')) as mintime_all,
--max(unix_timestamp(enddtm, 'MM/dd/yyyy HH:mm:ss')) as maxtime_all
from x
group by id
) --select * from z
select x1.id,
( unix_timestamp(x1.time2, 'MM/dd/yyyy HH:mm:ss') - unix_timestamp(z.mintime_all, 'MM/dd/yyyy HH:mm:ss') ) as before_d,
( unix_timestamp(z.maxtime_all, 'MM/dd/yyyy HH:mm:ss') - unix_timestamp(x2.time1, 'MM/dd/yyyy HH:mm:ss') ) as after_d
from y as y
inner join x as x1 on x1.id=y.id and x1.sequence=y.min_seq
inner join x as x2 on x2.id=y.id and x2.sequence=y.max_seq
inner join z as z on y.id=z.id;
这将产生预期的答案:
id before_d after_d
2456 25 19
9702 6 11