我有一张如下所示的表格:
CREATE TABLE time_records (
id uuid NOT NULL,
employee_id uuid NOT NULL,
starttime timestampt NOT NULL,
endtime timestampt NOT NULL
)
同一个employee_id的记录之间会有时间重叠:
id | 员工_id | 开始时间 | 结束时间 |
---|---|---|---|
1 | 1 | '2023-09-01 07:00:00' | '2023-09-01 09:15:00' |
2 | 1 | '2023-09-01 07:00:00' | '2023-09-01 15:00:00' |
3 | 1 | '2023-09-01 07:00:00' | '2023-09-01 15:00:00' |
4 | 1 | '2023-09-01 14:00:00' | '2023-09-01 15:00:00' |
5 | 1 | '2023-09-01 23:45:00' | '2023-09-01 23:59:00' |
6 | 1 | '2023-09-01 23:45:00' | '2023-09-01 23:59:00' |
我想做的是获取所有这些时间范围内的时间范围:
员工_id | 开始时间 | 结束时间 | id |
---|---|---|---|
1 | '2023-09-01 07:00:00' | '2023-09-01 15:00:00' | [1,2,3,4] |
1 | '2023-09-01 23:45:00' | '2023-09-01 23:29:00' | [5,6] |
如果一天内只有一组重叠时间,使用最大/分钟作为开始和结束时间,我可以让它工作,但当一天中有多组重叠时间时,我似乎无法让它工作:
select timea.employee_id,
min(timea.starttime) starttime,
max(timea.endtime) endtime,
array_agg(timea.id) ids
from time_records timea
inner join time_records timea2 on timea.employee_id = timea2.employee_id and
tsrange(timea2.starttime, timea2.endtime, '[]') &&
tsrange(timea.starttime, timea.endtime, '[]')
and timea.id != timea2.id
group by timea.employee_id;
结果:
员工_id | 开始时间 | 结束时间 | id |
---|---|---|---|
1 | '2023-09-01 07:00:00' | '2023-09-01 23:59:00' | [1,2,3,4,5,6] |
使用
cte
为每个 endtime
生成最大 starttime
,然后可以找到最大的重叠间隔,并通过聚合将原始 time_records
表重新连接到它上面:
with cte as (
select t.employee_id, t.starttime, max(t.endtime) m from time_records t
group by t.employee_id, t.starttime
)
select c.employee_id, c.starttime, c.m, array_agg(t.id)
from cte c join time_records t on c.starttime <= t.starttime and t.endtime <= c.m
where not exists (select 1 from cte t1 where t1.employee_id = c.employee_id and t1.starttime < c.starttime and c.m <= t1.m)
group by c.employee_id, c.starttime, c.m
当一天中有多组重叠时间时使其发挥作用
使用
min()
和 max()
进行普通聚合无法解决此问题。哪些行最终形成一个组仅在合并范围之后才变得明显。
range_agg()
使任务变得更加简单。它是随 Postgres 14 添加的。现在计算合并范围非常简单:
SELECT unnest(range_agg(tsrange(starttime, endtime, '[]'))) AS merged_range
FROM time_records;
为了获取涉及的 ID 数组,我们还需要做更多的事情。一种方法是连接回基础表,然后再次聚合(现在由合并范围标识组):
SELECT employee_id, lower(merged) AS starttime, upper(merged) AS endtime
, array_agg(t.id) AS ids
FROM (
SELECT employee_id, unnest(range_agg(tsrange(starttime, endtime, '[]'))) AS merged
FROM time_records
GROUP BY employee_id
) r
JOIN time_records t USING (employee_id)
WHERE r.merged @> t.starttime
GROUP BY r.employee_id, r.merged
ORDER BY r.employee_id, r.merged;
使用
LATERAL
子查询的另一种方式:
SELECT r.employee_id, lower(r.merged) AS starttime, upper(r.merged) AS endtime, i.ids
FROM (
SELECT employee_id, unnest(range_agg(tsrange(starttime, endtime, '[]'))) AS merged
FROM time_records
GROUP BY employee_id
) r
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT t.id
FROM time_records t
WHERE t.employee_id = r.employee_id
AND t.starttime <@ r.merged
ORDER BY t.id
)
) i (ids)
ORDER BY r.employee_id, r.merged;
相关:
不确定任一查询是否也比下面我的自定义函数更快,因为它只迭代整个表一次。
当停留在过时的版本上时,创建一个自定义设置返回函数(一次):
CREATE OR REPLACE FUNCTION public.f_merge_ranges()
RETURNS TABLE (
employee_id int
, starttime timestamp
, endtime timestamp
, ids int[]
)
LANGUAGE plpgsql AS
$func$
DECLARE
r record; -- current row
BEGIN
FOR r IN
SELECT t.id, t.employee_id, t.starttime, t.endtime
FROM time_records t
ORDER BY t.employee_id, t.starttime, t.endtime DESC, t.id -- better take longer range first
LOOP
IF r.employee_id = employee_id THEN -- works for null in first iteration
IF r.starttime > endtime THEN
RETURN NEXT;
starttime := r.starttime;
endtime := r.endtime;
ids := ARRAY[r.id];
ELSE
ids := ids || r.id;
IF r.endtime > endtime THEN
endtime := r.endtime;
END IF;
END IF;
ELSE
IF employee_id IS NOT NULL THEN -- catch first iteration
RETURN NEXT;
END IF;
employee_id := r.employee_id;
starttime := r.starttime;
endtime := r.endtime;
ids := ARRAY[r.id];
END IF;
END LOOP;
-- return last row (if any)
IF FOUND THEN
RETURN NEXT;
END IF;
END
$func$;
致电:
SELECT * FROM public.f_merge_ranges();
与上面的查询不同,
ids
中的数组是未排序的。如果你需要的话,你需要做更多。
相关: