我在配置单元表中的数据如下所示,其中id,start_time和end_time为字符串
id start_time end_time
101 10:00 12:00
101 10:15 12:30
101 12:15 12:45
101 13:00 14:00
102 10:15 10:30
我想创建一个新的字段group_id,该字段标识每个“ id”内具有重叠的start_time和end_time间隔的记录。所需的输出是:
id start_time end_time group_id
101 10:00 12:00 1
101 10:15 12:30 1
101 12:15 12:45 1
101 13:00 14:00 2
102 10:15 10:30 3
例如,在101条记录中,前3条记录是重叠的:第二个与第一个重叠,因为10:15(第二个开始时间)在10:00和12:00(第一个开始和结束时间)之间。第3个与第2个重叠,因为12:15(第3个开始时间)在10:15和12:30(第2个开始和结束时间)之间。第四条记录没有重叠,因此分配了下一个组ID(2)。最后一条记录具有不同的ID,并且在组中单独存在,因此为它提供了下一个ID(3)
我试图将记录与其上一个记录进行比较,以使用lag函数检查其是否重叠:
select id, start_time,end_time,
case when rownum_per_id = 1 THEN 'TRUE'
when start_time between lag(start_time,1) over w and lag(end_time,1) over w THEN 'TRUE'
ELSE 'FALSE' END as overlap_ind
from
(select id,start_time,end_time,
row_number() over(partition by id order by start_time) as rownum_per_id
from (select id,
from_unixtime(unix_timestamp(start_time,"HH:mm")) as start_time,
from_unixtime(unix_timestamp(end_time,"HH:mm")) as end_time
from test_table
) a
) b
window w as (partition by id order by start_time)
输出为:
id start_time end_time overlap_ind
101 1970-01-01 10:00:00 1970-01-01 12:00:00 TRUE
101 1970-01-01 10:15:00 1970-01-01 12:30:00 TRUE
101 1970-01-01 12:15:00 1970-01-01 12:45:00 TRUE
101 1970-01-01 13:00:00 1970-01-01 14:00:00 FALSE
102 1970-01-01 10:15:00 1970-01-01 10:30:00 TRUE
但是无法找出下一步分配递增的group_id
您可以只使用字符串操作。一种方法是每次计算累积最大值before。使用该信息来确定组开始的时间-然后是累加的总和:
select t.*,
sum(case when prev_end_time >= start_time then 0
else 1 end) over (partition by id order by start_time) as group_id
from (select t.*,
max(end_time) over (partition by id order by start_time rows between unbounded preceding and 1 preceding) as prev_end_time
from finance_acturl_data_smith.test_table t
) t;
编辑:
上面为每个用户生成一个单独的group_id
。我们可以通过删除partition by
:
select t.*,
sum(case when prev_end_time >= start_time then 0
else 1 end) over (order by id, start_time) as group_id
from (select t.*,
max(end_time) over (partition by id order by start_time rows between unbounded preceding and 1 preceding) as prev_end_time
from finance_acturl_data_smith.test_table t
) t;
这是可行的,因为第一个partition by
仍用于定义prev_end_time
,因此每个id
的第一个值为NULL
并转到else
。