Hive窗口功能:将组ID分配给重叠组

问题描述 投票:0回答:1

我在配置单元表中的数据如下所示,其中id,start_time和end_time为字符串

id   start_time   end_time
101  10:00        12:00
101  10:15        12:30
101  12:15        12:45
101  13:00        14:00
102  10:15        10:30

我想创建一个新的字段group_id,该字段标识每个“ id”内具有重叠的start_time和end_time间隔的记录。所需的输出是:

id   start_time   end_time group_id
101  10:00        12:00     1
101  10:15        12:30     1
101  12:15        12:45     1
101  13:00        14:00     2
102  10:15        10:30     3

例如,在101条记录中,前3条记录是重叠的:第二个与第一个重叠,因为10:15(第二个开始时间)在10:00和12:00(第一个开始和结束时间)之间。第3个与第2个重叠,因为12:15(第3个开始时间)在10:15和12:30(第2个开始和结束时间)之间。第四条记录没有重叠,因此分配了下一个组ID(2)。最后一条记录具有不同的ID,并且在组中单独存在,因此为它提供了下一个ID(3)

我试图将记录与其上一个记录进行比较,以使用lag函数检查其是否重叠:

select id, start_time,end_time,
    case when rownum_per_id = 1 THEN 'TRUE'
         when start_time between lag(start_time,1) over w and lag(end_time,1) over w THEN 'TRUE'
         ELSE 'FALSE' END as overlap_ind
from 
    (select id,start_time,end_time,
        row_number() over(partition by id order by start_time) as rownum_per_id
     from (select id,
             from_unixtime(unix_timestamp(start_time,"HH:mm")) as start_time,
             from_unixtime(unix_timestamp(end_time,"HH:mm")) as end_time
           from test_table
         ) a
    ) b
window w as (partition by id order by start_time)

输出为:

id  start_time          end_time            overlap_ind
101 1970-01-01 10:00:00 1970-01-01 12:00:00 TRUE
101 1970-01-01 10:15:00 1970-01-01 12:30:00 TRUE
101 1970-01-01 12:15:00 1970-01-01 12:45:00 TRUE
101 1970-01-01 13:00:00 1970-01-01 14:00:00 FALSE
102 1970-01-01 10:15:00 1970-01-01 10:30:00 TRUE

但是无法找出下一步分配递增的group_id

sql hiveql
1个回答
0
投票

您可以只使用字符串操作。一种方法是每次计算累积最大值before。使用该信息来确定组开始的时间-然后是累加的总和:

select t.*,
       sum(case when prev_end_time >= start_time then 0
                else 1 end) over (partition by id order by start_time) as group_id
from (select t.*,
             max(end_time) over (partition by id order by start_time rows between unbounded preceding and 1 preceding) as prev_end_time
      from finance_acturl_data_smith.test_table t
     ) t;

编辑:

上面为每个用户生成一个单独的group_id。我们可以通过删除partition by

来进行调整
select t.*,
       sum(case when prev_end_time >= start_time then 0
                else 1 end) over (order by id, start_time) as group_id
from (select t.*,
             max(end_time) over (partition by id order by start_time rows between unbounded preceding and 1 preceding) as prev_end_time
      from finance_acturl_data_smith.test_table t
     ) t;

这是可行的,因为第一个partition by仍用于定义prev_end_time,因此每个id的第一个值为NULL并转到else

© www.soinside.com 2019 - 2024. All rights reserved.