我有一个不同类型点的数据集。对于数据集中的每个点,我想找到每个类别中最接近的点。我可以实现这一目标,但计算时间非常长,而且我正在努力让查询以有效的方式使用 KNN 的空间索引和类型信息。
样本数据生成
CREATE TYPE point_type AS ENUM ('1','2','3','4','5');
CREATE TABLE points AS
SELECT ST_MakePoint(
1000*random(),
1000*random()
)::geometry(Point) AS geom,
((random()*3)::int+1)::text::point_type point_type,
pk
FROM generate_series(1,6000) pk;
update points
set point_type='5' where pk=999;
索引创建
create index points_geom_idx
on points using gist (geom);
CREATE INDEX points_dual ON points (point_type, geom);
查询有效,但速度很慢,但有效
因为距离 KNN 首先被拉动,然后通过类型约束进行过滤?
explain analyse
with types as (
select column1::point_type point_type from (
values
('1'), ('2'), ('3'), ('4'),('5')
)
)
SELECT c1.point_type,
c1.pk AS main_id,
b.pk AS secondary_id,
c1.secondary_point_type,
b.secondary_point_type,
b.distance
FROM (SELECT c.point_type,
c.pk,
c.geom,
types.point_type secondary_point_type
FROM points c
join types on true
) c1
LEFT JOIN LATERAL ( SELECT c2.point_type,
c2.geom,
c2.pk,
c2.point_type secondary_point_type,
c1.geom <->c2.geom AS distance
FROM points c2
where c1.pk <> c2.pk and c1.secondary_point_type=c2.point_type
ORDER BY distance
LIMIT 1) b on true;
查询速度非常快,但无法提供正确的结果 我相信这是因为它只是获取最近的点,如果该点的类型不正确,则连接最终会失败,因此不会连接任何数据,为大多数结果留下空值
explain analyse
with types as (
select column1::point_type point_type from (
values
('1'), ('2'), ('3'), ('4'),('5')
)
)
SELECT c1.point_type,
c1.pk AS main_id,
b.pk AS secondary_id,
c1.secondary_point_type,
b.secondary_point_type,
b.distance
FROM (SELECT c.point_type,
c.pk,
c.geom,
types.point_type secondary_point_type
FROM points c
join types on true
) c1
LEFT JOIN LATERAL ( SELECT c2.point_type,
c2.geom,
c2.pk,
c2.point_type secondary_point_type,
c1.geom <->c2.geom AS distance
FROM points c2
where c1.pk <> c2.pk
ORDER BY distance
LIMIT 1) b on c1.secondary_point_type=b.secondary_point_type ;
我正在尝试快速实现此查询,对所有类型的所有 knn 度量使用空间索引。谢谢!
用于分析的输出 第一个查询:
Sort (cost=29155.39..29230.39 rows=30000 width=28) (actual time=24533.167..24543.539 rows=30000 loops=1)
" Output: c.point_type, c.pk, c2.pk, ((""*VALUES*"".column1)::point_type), c2.point_type, ((c.geom <-> c2.geom))"
Sort Key: c2.point_type DESC
Sort Method: quicksort Memory: 2409kB
Buffers: shared hit=180999
-> Nested Loop Left Join (cost=0.15..26924.49 rows=30000 width=28) (actual time=5.024..24430.122 rows=30000 loops=1)
" Output: c.point_type, c.pk, c2.pk, (""*VALUES*"".column1)::point_type, c2.point_type, ((c.geom <-> c2.geom))"
Buffers: shared hit=180999
-> Nested Loop (cost=0.00..499.07 rows=30000 width=72) (actual time=0.546..105.076 rows=30000 loops=1)
" Output: c.point_type, c.pk, c.geom, ""*VALUES*"".column1"
Buffers: shared hit=64
-> Seq Scan on public.points c (cost=0.00..124.00 rows=6000 width=40) (actual time=0.341..12.850 rows=6000 loops=1)
Output: c.geom, c.point_type, c.pk
Buffers: shared hit=64
-> Materialize (cost=0.00..0.09 rows=5 width=32) (actual time=0.001..0.006 rows=5 loops=6000)
" Output: ""*VALUES*"".column1"
" -> Values Scan on ""*VALUES*"" (cost=0.00..0.06 rows=5 width=32) (actual time=0.034..0.141 rows=5 loops=1)"
" Output: ""*VALUES*"".column1"
-> Limit (cost=0.15..0.86 rows=1 width=52) (actual time=0.802..0.803 rows=1 loops=30000)
Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
Buffers: shared hit=180935
-> Result (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.800..0.800 rows=1 loops=30000)
Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
" One-Time Filter: ((""*VALUES*"".column1)::point_type = (""*VALUES*"".column1)::point_type)"
Buffers: shared hit=180935
-> Index Scan using points_geom_idx on public.points c2 (cost=0.15..500.15 rows=5999 width=40) (actual time=0.787..0.787 rows=1 loops=30000)
Output: c2.geom, c2.point_type, c2.pk
Order By: (c2.geom <-> c.geom)
Filter: (c.pk <> c2.pk)
Rows Removed by Filter: 1
Buffers: shared hit=180935
Settings: search_path = 'public, topology, tiger'
Planning Time: 4.964 ms
Execution Time: 24553.107 ms
第二个查询:
QUERY PLAN
Nested Loop (cost=0.88..1197.38 rows=30000 width=28) (actual time=3.535..4538.832 rows=30000 loops=1)
" Output: c.point_type, c.pk, b.pk, (""*VALUES*"".column1)::point_type, b.secondary_point_type, b.distance"
Buffers: shared hit=36251
-> Seq Scan on public.points c (cost=0.00..124.00 rows=6000 width=40) (actual time=0.095..4.897 rows=6000 loops=1)
Output: c.geom, c.point_type, c.pk
Buffers: shared hit=64
-> Hash Left Join (cost=0.88..0.98 rows=5 width=48) (actual time=0.726..0.743 rows=5 loops=6000)
" Output: ""*VALUES*"".column1, b.pk, b.secondary_point_type, b.distance"
" Hash Cond: ((""*VALUES*"".column1)::point_type = b.secondary_point_type)"
Buffers: shared hit=36187
" -> Values Scan on ""*VALUES*"" (cost=0.00..0.06 rows=5 width=32) (actual time=0.001..0.008 rows=5 loops=6000)"
" Output: ""*VALUES*"".column1"
-> Hash (cost=0.87..0.87 rows=1 width=16) (actual time=0.707..0.707 rows=1 loops=6000)
Output: b.pk, b.secondary_point_type, b.distance
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=36187
-> Subquery Scan on b (cost=0.15..0.87 rows=1 width=16) (actual time=0.701..0.703 rows=1 loops=6000)
Output: b.pk, b.secondary_point_type, b.distance
Buffers: shared hit=36187
-> Limit (cost=0.15..0.86 rows=1 width=52) (actual time=0.700..0.700 rows=1 loops=6000)
Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
Buffers: shared hit=36187
-> Index Scan using points_geom_idx on public.points c2 (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.695..0.695 rows=1 loops=6000)
Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
Order By: (c2.geom <-> c.geom)
Filter: (c.pk <> c2.pk)
Rows Removed by Filter: 1
Buffers: shared hit=36187
Settings: search_path = 'public, topology, tiger'
Planning Time: 3.206 ms
Execution Time: 4549.481 ms
您的第二个索引需要是 GiST 类型,就像您的第一个索引一样。为此,您需要扩展 btree_gist 的帮助。
CREATE EXTENSION btree_gist;
CREATE INDEX points_dual ON points using gist (point_type, geom);