这是我的问题的简化版本
我们有一个名为
positions
的表,它存储一个项目在多个容器中的移动情况。
每条记录包含
container
)date_from
和 date_to
,其中包含项目进入和离开容器的时间戳连续两条记录之间可能存在“时间间隙”。 IE。该物品在上午 10 点之前一直在容器 A 中,然后在下午 4 点出现在容器 B 中,中间没有任何东西。
这是一个示例数据集
身份证 |
|
|
|
---|---|---|---|
1 | A | 2023-10-01T00:00:00 | 2023-10-01T10:00:00 |
2 | A | 2023-10-03T09:00:00 | 2023-10-03T11:00:00 |
3 | B | 2023-10-04T02:00:00 | 2023-10-04T03:00:00 |
4 | C | 2023-10-04T06:00:00 | 2023-10-04T08:00:00 |
5 | C | 2023-10-05T00:00:00 | 2023-10-06T10:00:00 |
6 | A | 2023-10-06T11:00:00 | 2023-10-06T20:00:00 |
7 | C | 2023-10-06T21:00:00 | 2023-10-07T10:00:00 |
我需要挤压所有连续的相邻位置
date_from
距离前一个位置的 date_to
在一定时间阈值内。对于我压缩的每个子序列,我需要获取
date_from
的第一个值和 date_to
的最后一个值,并将它们放在同一结果行中。
例如,如果容器 A 中有 5 个连续的记录,并且根据规则它们足够接近以被压扁,那么我压扁这些位置的最后一行将有
container
= Adate_from
取自我压扁的5个位置中的第一个date_to
取自 5 个位置的最后一个 WITH with_next_position AS (
SELECT
id,
container,
date_from,
date_to,
(
SELECT subquery.id
FROM positions subquery
WHERE subquery.date_from > base.date_from
ORDER BY subquery.date_from ASC
LIMIT 1
) AS next_position_id
FROM positions
),
with_time_lapse AS (
SELECT
with_next_position.date_from AS date_from,
with_next_position.date_to AS date_from,
with_next_position.container AS container,
CASE
WHEN join_table.date_from IS NOT NULL
THEN EXTRACT(EPOCH FROM (join_table.date_from - with_next_position.date_to))
ELSE
NULL
END AS time_lapse,
join_table.marina_id AS next_container
FROM
with_next_position
FULL OUTER JOIN with_next_position join_table ON join_table.id = with_next_position.next_position_id
WHERE
with_next_position.container IS NOT NULL
),
with_marked_to_squash AS (
SELECT
date_from,
date_to,
container,
CASE
WHEN next_container = container AND time_lapse <= 10000000 # This is where I put the threshold
THEN TRUE
ELSE
FALSE
END AS to_squash
FROM with_time_lapse
)
with_marked_first_to_squash AS (
SELECT
date_from,
date_to,
container,
CASE
WHEN to_squash
THEN (
SELECT CASE WHEN to_squash THEN FALSE ELSE TRUE END
FROM with_marked_to_squash subquery
WHERE subquery.date_from < with_marked_to_squash.date_from
ORDER BY subquery.date_from DESC
LIMIT 1
)
ELSE
FALSE
END AS first_to_squash
FROM with_marked_to_squash
),
with_first_to_squash AS (
SELECT
date_from,
date_to,
container,
(
SELECT subquery.date_from
FROM with_marked_first_to_squash subquery
WHERE subquery.date_from < with_marked_first_to_squash.date_from AND first_to_squash IS TRUE
ORDER BY subquery.date_from DESC
LIMIT 1
) AS first_date_in_position
FROM with_marked_first_to_squash
WHERE to_squash IS FALSE
)
SELECT
COALESCE(first_date_in_position, date_from) AS date_from,
date_to,
container
EXTRACT(EPOCH FROM (date_to - COALESCE(first_date_in_position, date_from))) AS time_spent
FROM with_first_to_squash
ORDER BY date_from
上面的查询是正确的,它符合我的预期。然而,在提取子查询
with_first_to_squash
时会出现性能问题。如果我将查询削减到 with_first_to_squash
之前,性能会呈指数级提高。
我认为性能问题的原因是,通过连续运行
with_marked_first_to_squash
和with_first_to_squash
,我使数据库引擎经过两个嵌套循环:
with_marked_first_to_squash
的定义内)date_from
在我删除第二个子查询的那一刻,事情变得非常快。
我确信有一个解决方案可以从子序列的第一个位置提取
date_from
,可能涉及分区,但我不熟悉分区及其语法。有没有人可以给我提示?
我怀疑你的
select
列表中的子查询正在影响你的性能。
请尝试以下窗口函数解决方案来解决您的间隙和岛屿问题,因为它只需要排序一次:
with squashes as (
select *,
case
when container = lag(container) over w
and date_from - lag(date_to) over w <= interval '5 days' then false
else true
end as keep_me
from positions
window w as (order by date_from)
), islands as (
select *, sum(keep_me::int) over (order by date_from) as group_num
from squashes
)
select container, min(date_from) as date_from, max(date_to) as date_to
from islands
group by group_num, container
order by group_num;
工作小提琴