我在Postgres数据库表中有很多测量值,当某个值与当前组的“起始”点相距太远时(需要一些threshold),我需要将该组分成几组。排序顺序由id
列确定。
示例:用threshold = 1
分割:
id measurements
---------------
1 1.5
2 1.4
3 1.8
4 2.6
5 3.7
6 3.5
7 3.0
8 2.6
9 2.5
10 2.8
应按以下方式分成几组:
id measurements group
---------------------
1 1.5 0 --- start new group
2 1.4 0
3 1.8 0
4 2.6 1 --- start new group because it too far from 1.5
5 3.7 2 --- start new group because it too far from 2.6
6 3.5 2
7 3.0 2
8 2.6 3 --- start new group because it too far from 3.7
9 2.5 3
10 2.8 3
我可以通过使用LOOP
编写函数来做到这一点,但我正在寻找一种更有效的方法。性能非常重要,因为实际表包含数百万行。
是否可以通过使用PARTITION OVER
,CTE
或任何其他类型的SELECT
实现目标?
当行之间的差异超过0.5时,您似乎正在开始组。如果我假设您有一个排序列,则可以使用lag()
和累积总和来获取组:
select t.*,
count(*) filter (where prev_value < value - 0.5) as grouping
from (select t.*,
lag(value) over (order by <ordering col>) as prev_value
from t
) t
解决此问题的一种方法是使用递归CTE。本示例使用SQL Server语法编写(因为我不使用postgres)。但是,翻译应该很简单。
-- Table #Test:
-- sequenceno measurements
-- ----------- ------------
-- 1 1.5
-- 2 1.4
-- 3 1.8
-- 4 2.6
-- 5 3.7
-- 6 3.5
-- 7 3.0
-- 8 2.6
-- 9 2.5
-- 10 2.8
WITH datapoints
AS
(
SELECT sequenceno,
measurements,
startmeasurement = measurements,
groupno = 0
FROM #Test
WHERE sequenceno = 1
UNION ALL
SELECT sequenceno = A.sequenceno + 1,
measurements = B.measurements,
startmeasurement =
CASE
WHEN abs(B.measurements - A.startmeasurement) >= 1 THEN B.measurements
ELSE A.startmeasurement
END,
groupno =
A.groupno +
CASE
WHEN abs(B.measurements - A.startmeasurement) >= 1 THEN 1
ELSE 0
END
FROM datapoints as A
INNER JOIN #Test as B
ON A.sequenceno + 1 = B.sequenceno
)
SELECT sequenceno,
measurements,
groupno
FROM datapoints
ORDER BY
sequenceno
-- Output:
-- sequenceno measurements groupno
-- ----------- --------------- -------
-- 1 1.5 0
-- 2 1.4 0
-- 3 1.8 0
-- 4 2.6 1
-- 5 3.7 2
-- 6 3.5 2
-- 7 3.0 2
-- 8 2.6 3
-- 9 2.5 3
-- 10 2.8 3
注意,我在起始表中添加了“ sequenceno”列,因为关系表被认为是无序集合。另外,如果输入值的数量太大(超过90-100),则可能必须调整MAXRECURSION值(至少在SQL Server中)。
[附加说明:刚注意到原始问题提到输入数据集中有数百万条记录。仅当该数据可以分解为可管理的块时,CTE方法才有效。
是否可以通过使用
PARTITION OVER
,CTE
或任何其他类型的SELECT
实现目标?
这是一个[[固有的[[过程问题]。根据您的开始位置,所有以后的行都可以以不同的组和/或不同的组值结尾。 Window functions(使用PARTITION
子句)对此不利。
WITH RECURSIVE rcte AS (
(
SELECT id
, measurement
, measurement - 1 AS grp_min
, measurement + 1 AS grp_max
, 1 AS grp
FROM tbl
ORDER BY id
LIMIT 1
)
UNION ALL
(
SELECT t.id
, t.measurement
, CASE WHEN t.same_grp THEN r.grp_min ELSE t.measurement - 1 END -- AS grp_min
, CASE WHEN t.same_grp THEN r.grp_max ELSE t.measurement + 1 END -- AS grp_max
, CASE WHEN t.same_grp THEN r.grp ELSE r.grp + 1 END -- AS grp
FROM rcte r
CROSS JOIN LATERAL (
SELECT *, t.measurement BETWEEN r.grp_min AND r.grp_max AS same_grp
FROM tbl t
WHERE t.id > r.id
ORDER BY t.id
LIMIT 1
) t
)
)
SELECT id, measurement, grp
FROM rcte;
很优雅。而且相当快。但是,只有高效地实现时,它与程序语言功能(在集合上有一个循环)的速度一样快,甚至比它慢。CREATE OR REPLACE FUNCTION f_measurement_groups(_threshold numeric = 1)
RETURNS TABLE (id int, grp int, measurement numeric) AS
$func$
DECLARE
_grp_min numeric;
_grp_max numeric;
BEGIN
grp := 0; -- init
FOR id, measurement IN
SELECT * FROM tbl t ORDER BY t.id
LOOP
IF measurement BETWEEN _grp_min AND _grp_max THEN
RETURN NEXT;
ELSE
SELECT INTO grp , _grp_min , _grp_max
grp + 1, measurement - _threshold, measurement + _threshold;
RETURN NEXT;
END IF;
END LOOP;
END
$func$ LANGUAGE plpgsql;
通话:
SELECT * FROM f_measurement_groups(); -- optionally supply different threshold
db <>小提琴here我的钱用于程序功能。通常,基于集合的解决方案速度更快。但是,当解决固有的procedural
问题时,则不是。相关: