KNN 使用索引在 postgis 中跨类别

问题描述 投票:0回答:1

我有一个不同类型点的数据集。对于数据集中的每个点,我想找到每个类别中最接近的点。我可以实现这一目标,但计算时间非常长,而且我正在努力让查询以有效的方式使用 KNN 的空间索引和类型信息。

样本数据生成

CREATE TYPE point_type AS ENUM ('1','2','3','4','5');

CREATE TABLE points AS
  SELECT ST_MakePoint(
    1000*random(),
    1000*random()
    )::geometry(Point) AS geom,
     ((random()*3)::int+1)::text::point_type  point_type,
         pk
  FROM generate_series(1,6000) pk;
update points
set point_type='5' where pk=999;

索引创建

create index points_geom_idx
    on points using gist (geom);

CREATE INDEX points_dual ON points (point_type, geom);

查询有效,但速度很慢,但有效

因为距离 KNN 首先被拉动,然后通过类型约束进行过滤?

explain analyse
with types as (
select column1::point_type point_type from (
values
('1'), ('2'), ('3'), ('4'),('5')
       )
)
SELECT c1.point_type,
       c1.pk AS main_id,
       b.pk  AS secondary_id,
       c1.secondary_point_type,
       b.secondary_point_type,
       b.distance
FROM (SELECT c.point_type,
             c.pk,
             c.geom,
             types.point_type secondary_point_type
      FROM  points c
          join types on true
          ) c1

         LEFT JOIN LATERAL ( SELECT c2.point_type,
                                    c2.geom,
                                    c2.pk,
                                    c2.point_type secondary_point_type,
                                    c1.geom <->c2.geom AS distance
                             FROM points c2

         where c1.pk <> c2.pk          and c1.secondary_point_type=c2.point_type

                             ORDER BY distance
                             LIMIT 1)  b on true;

查询速度非常快,但无法提供正确的结果 我相信这是因为它只是获取最近的点,如果该点的类型不正确,则连接最终会失败,因此不会连接任何数据,为大多数结果留下空值

explain analyse
with types as (
select column1::point_type point_type from (
values
('1'), ('2'), ('3'), ('4'),('5')
       )
)
SELECT c1.point_type,
       c1.pk AS main_id,
       b.pk  AS secondary_id,
       c1.secondary_point_type,
       b.secondary_point_type,
       b.distance
FROM (SELECT c.point_type,
             c.pk,
             c.geom,
             types.point_type secondary_point_type
      FROM  points c
          join types on true
          ) c1

         LEFT JOIN LATERAL ( SELECT c2.point_type,
                                    c2.geom,
                                    c2.pk,
                                    c2.point_type secondary_point_type,
                                    c1.geom <->c2.geom AS distance
                             FROM points c2

         where c1.pk <> c2.pk
                             ORDER BY distance
                             LIMIT 1)  b on c1.secondary_point_type=b.secondary_point_type ;

我正在尝试快速实现此查询,对所有类型的所有 knn 度量使用空间索引。谢谢!

用于分析的输出 第一个查询:

Sort  (cost=29155.39..29230.39 rows=30000 width=28) (actual time=24533.167..24543.539 rows=30000 loops=1)
"  Output: c.point_type, c.pk, c2.pk, ((""*VALUES*"".column1)::point_type), c2.point_type, ((c.geom <-> c2.geom))"
  Sort Key: c2.point_type DESC
  Sort Method: quicksort  Memory: 2409kB
  Buffers: shared hit=180999
  ->  Nested Loop Left Join  (cost=0.15..26924.49 rows=30000 width=28) (actual time=5.024..24430.122 rows=30000 loops=1)
"        Output: c.point_type, c.pk, c2.pk, (""*VALUES*"".column1)::point_type, c2.point_type, ((c.geom <-> c2.geom))"
        Buffers: shared hit=180999
        ->  Nested Loop  (cost=0.00..499.07 rows=30000 width=72) (actual time=0.546..105.076 rows=30000 loops=1)
"              Output: c.point_type, c.pk, c.geom, ""*VALUES*"".column1"
              Buffers: shared hit=64
              ->  Seq Scan on public.points c  (cost=0.00..124.00 rows=6000 width=40) (actual time=0.341..12.850 rows=6000 loops=1)
                    Output: c.geom, c.point_type, c.pk
                    Buffers: shared hit=64
              ->  Materialize  (cost=0.00..0.09 rows=5 width=32) (actual time=0.001..0.006 rows=5 loops=6000)
"                    Output: ""*VALUES*"".column1"
"                    ->  Values Scan on ""*VALUES*""  (cost=0.00..0.06 rows=5 width=32) (actual time=0.034..0.141 rows=5 loops=1)"
"                          Output: ""*VALUES*"".column1"
        ->  Limit  (cost=0.15..0.86 rows=1 width=52) (actual time=0.802..0.803 rows=1 loops=30000)
              Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
              Buffers: shared hit=180935
              ->  Result  (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.800..0.800 rows=1 loops=30000)
                    Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
"                    One-Time Filter: ((""*VALUES*"".column1)::point_type = (""*VALUES*"".column1)::point_type)"
                    Buffers: shared hit=180935
                    ->  Index Scan using points_geom_idx on public.points c2  (cost=0.15..500.15 rows=5999 width=40) (actual time=0.787..0.787 rows=1 loops=30000)
                          Output: c2.geom, c2.point_type, c2.pk
                          Order By: (c2.geom <-> c.geom)
                          Filter: (c.pk <> c2.pk)
                          Rows Removed by Filter: 1
                          Buffers: shared hit=180935
Settings: search_path = 'public, topology, tiger'
Planning Time: 4.964 ms
Execution Time: 24553.107 ms

第二个查询:

QUERY PLAN
Nested Loop  (cost=0.88..1197.38 rows=30000 width=28) (actual time=3.535..4538.832 rows=30000 loops=1)
"  Output: c.point_type, c.pk, b.pk, (""*VALUES*"".column1)::point_type, b.secondary_point_type, b.distance"
  Buffers: shared hit=36251
  ->  Seq Scan on public.points c  (cost=0.00..124.00 rows=6000 width=40) (actual time=0.095..4.897 rows=6000 loops=1)
        Output: c.geom, c.point_type, c.pk
        Buffers: shared hit=64
  ->  Hash Left Join  (cost=0.88..0.98 rows=5 width=48) (actual time=0.726..0.743 rows=5 loops=6000)
"        Output: ""*VALUES*"".column1, b.pk, b.secondary_point_type, b.distance"
"        Hash Cond: ((""*VALUES*"".column1)::point_type = b.secondary_point_type)"
        Buffers: shared hit=36187
"        ->  Values Scan on ""*VALUES*""  (cost=0.00..0.06 rows=5 width=32) (actual time=0.001..0.008 rows=5 loops=6000)"
"              Output: ""*VALUES*"".column1"
        ->  Hash  (cost=0.87..0.87 rows=1 width=16) (actual time=0.707..0.707 rows=1 loops=6000)
              Output: b.pk, b.secondary_point_type, b.distance
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
              Buffers: shared hit=36187
              ->  Subquery Scan on b  (cost=0.15..0.87 rows=1 width=16) (actual time=0.701..0.703 rows=1 loops=6000)
                    Output: b.pk, b.secondary_point_type, b.distance
                    Buffers: shared hit=36187
                    ->  Limit  (cost=0.15..0.86 rows=1 width=52) (actual time=0.700..0.700 rows=1 loops=6000)
                          Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
                          Buffers: shared hit=36187
                          ->  Index Scan using points_geom_idx on public.points c2  (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.695..0.695 rows=1 loops=6000)
                                Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
                                Order By: (c2.geom <-> c.geom)
                                Filter: (c.pk <> c2.pk)
                                Rows Removed by Filter: 1
                                Buffers: shared hit=36187
Settings: search_path = 'public, topology, tiger'
Planning Time: 3.206 ms
Execution Time: 4549.481 ms
postgresql postgis knn spatial-index lateral-join
1个回答
0
投票

您的第二个索引需要是 GiST 类型,就像您的第一个索引一样。为此,您需要扩展 btree_gist 的帮助。

CREATE EXTENSION btree_gist;
CREATE INDEX points_dual ON points using gist (point_type, geom);
© www.soinside.com 2019 - 2024. All rights reserved.