Hive窗口功能：将组ID分配给重叠组

Question

我在配置单元表中的数据如下所示，其中id，start_time和end_time为字符串

id   start_time   end_time
101  10:00        12:00
101  10:15        12:30
101  12:15        12:45
101  13:00        14:00
102  10:15        10:30

我想创建一个新的字段group_id，该字段标识每个“ id”内具有重叠的start_time和end_time间隔的记录。所需的输出是：

id   start_time   end_time group_id
101  10:00        12:00     1
101  10:15        12:30     1
101  12:15        12:45     1
101  13:00        14:00     2
102  10:15        10:30     3

例如，在101条记录中，前3条记录是重叠的：第二个与第一个重叠，因为10:15（第二个开始时间）在10:00和12:00（第一个开始和结束时间）之间。第3个与第2个重叠，因为12:15（第3个开始时间）在10:15和12:30（第2个开始和结束时间）之间。第四条记录没有重叠，因此分配了下一个组ID（2）。最后一条记录具有不同的ID，并且在组中单独存在，因此为它提供了下一个ID（3）

我试图将记录与其上一个记录进行比较，以使用lag函数检查其是否重叠：

select id, start_time,end_time,
    case when rownum_per_id = 1 THEN 'TRUE'
         when start_time between lag(start_time,1) over w and lag(end_time,1) over w THEN 'TRUE'
         ELSE 'FALSE' END as overlap_ind
from 
    (select id,start_time,end_time,
        row_number() over(partition by id order by start_time) as rownum_per_id
     from (select id,
             from_unixtime(unix_timestamp(start_time,"HH:mm")) as start_time,
             from_unixtime(unix_timestamp(end_time,"HH:mm")) as end_time
           from test_table
         ) a
    ) b
window w as (partition by id order by start_time)

输出为：

id  start_time          end_time            overlap_ind
101 1970-01-01 10:00:00 1970-01-01 12:00:00 TRUE
101 1970-01-01 10:15:00 1970-01-01 12:30:00 TRUE
101 1970-01-01 12:15:00 1970-01-01 12:45:00 TRUE
101 1970-01-01 13:00:00 1970-01-01 14:00:00 FALSE
102 1970-01-01 10:15:00 1970-01-01 10:30:00 TRUE

但是无法找出下一步分配递增的group_id

Answer 1

您可以只使用字符串操作。一种方法是每次计算累积最大值before。使用该信息来确定组开始的时间-然后是累加的总和：

select t.*,
       sum(case when prev_end_time >= start_time then 0
                else 1 end) over (partition by id order by start_time) as group_id
from (select t.*,
             max(end_time) over (partition by id order by start_time rows between unbounded preceding and 1 preceding) as prev_end_time
      from finance_acturl_data_smith.test_table t
     ) t;

编辑：

上面为每个用户生成一个单独的group_id。我们可以通过删除partition by：

来进行调整

select t.*,
       sum(case when prev_end_time >= start_time then 0
                else 1 end) over (order by id, start_time) as group_id
from (select t.*,
             max(end_time) over (partition by id order by start_time rows between unbounded preceding and 1 preceding) as prev_end_time
      from finance_acturl_data_smith.test_table t
     ) t;

这是可行的，因为第一个partition by仍用于定义prev_end_time，因此每个id的第一个值为NULL并转到else。

Hive窗口功能：将组ID分配给重叠组

问题描述投票：0回答：1

1个回答

最新问题

Hive窗口功能：将组ID分配给重叠组

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1