Postgres 会忽略我的 gin 索引来进行相似性查询

问题描述 投票:0回答:1

在我的表(大约 1200 万行)中,我有一个名为

full_name
的文本字段和
id

我还有以下指标:

create index entities_full_name_gin_trgm_ops_index
  on temp.entities
  using gin(full_name gin_trgm_ops);

create index entities_id_index ON temp.entities (id);

我正在尝试运行以下查询:

select lhs.id, lhs.full_name, rhs.id, rhs.full_name
from temp.entities as lhs
left join lateral (
  select id, full_name
  from temp.entities
  where full_name % lhs.full_name and id < lhs.id
  limit 1
) as rhs on true
order by lhs.id desc
limit 10;

基本上,此查询正在搜索另一个具有相似名称且 ID 比当前 ID 旧的

entity
(我正在使用 UUIDV7)。

此查询将生成以下计划:

更新 1:使用

explain(analyze, verbose, buffers, settings)
添加查询计划。

                                                                                  QUERY PLAN                                  
                                                                                                                              
                                                                                                                              
                                                                                                                              
                                                 
------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------
 Limit  (cost=1.12..66.24 rows=10 width=68) (actual time=22616.601..289628.893 rows=10 loops=1)
   Output: lhs.id, lhs.full_name, entities.id, entities.full_name
   Buffers: shared hit=64213445 read=2646221 written=1696
   ->  Nested Loop Left Join  (cost=1.12..86936338.39 rows=13350926 width=68) (actual time=22616.600..289628.886 rows=10 loops
=1)
         Output: lhs.id, lhs.full_name, entities.id, entities.full_name
         Buffers: shared hit=64213445 read=2646221 written=1696
         ->  Index Scan Backward using entities_id_index on temp.entities lhs  (cost=0.56..719298.01 rows=13350926 width=34) (
actual time=0.035..0.055 rows=10 loops=1)
               Output: lhs.id, lhs.first_name, lhs.middle_name, lhs.last_name, lhs.name_suffix, lhs.full_name, lhs.address_hou
se_number, lhs.address_street_direction, lhs.address_street_name, lhs.address_street_suffix, lhs.address_street_post_direction
, lhs.address_unit_prefix, lhs.address_unit_value, lhs.address_city, lhs.address_state, lhs.address_zip, lhs.address_zip_4, lh
s.address_legacy, lhs.address_normalized, lhs.address, lhs.status, lhs.total_records, lhs.total_buy_records, lhs.total_sell_re
cords, lhs."fix_and_flip?", lhs."buy_and_hold?", lhs."wholesaler?", lhs.last_year, lhs.last_year_buy_records, lhs.last_year_bu
y_transfer_amount, lhs.last_year_sell_records, lhs.last_year_sell_transfer_amount, lhs.current_year, lhs.current_year_buy_reco
rds, lhs.current_year_buy_transfer_amount, lhs.current_year_sell_records, lhs.current_year_sell_transfer_amount, lhs.score, lh
s.score_version, lhs.inserted_at, lhs.updated_at
               Buffers: shared hit=2 read=8
         ->  Limit  (cost=0.56..6.45 rows=1 width=34) (actual time=28962.880..28962.880 rows=1 loops=10)
               Output: entities.id, entities.full_name
               Buffers: shared hit=64213443 read=2646213 written=1696
               ->  Index Scan using entities_id_index on temp.entities  (cost=0.56..262023.42 rows=44503 width=34) (actual tim
e=28962.878..28962.878 rows=1 loops=10)
                     Output: entities.id, entities.full_name
                     Index Cond: (entities.id < lhs.id)
                     Filter: ((entities.full_name)::text % (lhs.full_name)::text)
                     Rows Removed by Filter: 8771697
                     Buffers: shared hit=64213443 read=2646213 written=1696
 Settings: temp_buffers = '64MB', work_mem = '128MB', max_parallel_workers_per_gather = '8', enable_seqscan = 'off'
 Planning:
   Buffers: shared hit=16 read=2
 Planning Time: 0.364 ms
 Execution Time: 289628.935 ms
(23 rows)

如您所见,Postgres 使用的是我的 btree 索引,而不是我的 gin 索引,这使得此查询极其缓慢。

更新 2:根据 Laurenz Albe 的建议,我添加了一个 order by 来强制计划使用我的索引。这有效,但查询仍然非常慢(需要大约 6 秒才能完成,

limit 10
,我想运行真正的查询,限制为 1000 或 10000,这会慢得多。

这是新的查询:

select lhs.id, lhs.full_name, rhs.id, rhs.full_name
from temp.entities as lhs
left join lateral (
  select id, full_name
  from temp.entities
  where full_name % lhs.full_name and id < lhs.id
  order by full_name
  limit 1
) as rhs on true
order by lhs.id desc
limit 10;

这是最终的计划:

                                                                                                                                                                                                              QUERY PLAN                                      
                                                                                                                                                                                                                                                              
                                                                                                                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=224541.60..2469952.55 rows=10 width=68) (actual time=687.770..6063.586 rows=10 loops=1)
   Output: lhs.id, lhs.full_name, entities.id, entities.full_name
   Buffers: shared hit=479653 read=64097 written=295
   ->  Nested Loop Left Join  (cost=224541.60..2997831757508.70 rows=13350926 width=68) (actual time=687.769..6063.580 rows=10 loops=1)
         Output: lhs.id, lhs.full_name, entities.id, entities.full_name
         Buffers: shared hit=479653 read=64097 written=295
         ->  Index Scan Backward using entities_id_index on temp.entities lhs  (cost=0.56..719298.01 rows=13350926 width=34) (actual time=0.026..0.068 rows=10 loops=1)
               Output: lhs.id, lhs.first_name, lhs.middle_name, lhs.last_name, lhs.name_suffix, lhs.full_name, lhs.address_house_number, lhs.address_street_direction, lhs.address_street_name, lhs.address_street_suffix, lhs.address_street_post_direction, 
lhs.address_unit_prefix, lhs.address_unit_value, lhs.address_city, lhs.address_state, lhs.address_zip, lhs.address_zip_4, lhs.address_legacy, lhs.address_normalized, lhs.address, lhs.status, lhs.total_records, lhs.total_buy_records, lhs.total_sell_record
s, lhs."fix_and_flip?", lhs."buy_and_hold?", lhs."wholesaler?", lhs.last_year, lhs.last_year_buy_records, lhs.last_year_buy_transfer_amount, lhs.last_year_sell_records, lhs.last_year_sell_transfer_amount, lhs.current_year, lhs.current_year_buy_records, l
hs.current_year_buy_transfer_amount, lhs.current_year_sell_records, lhs.current_year_sell_transfer_amount, lhs.score, lhs.score_version, lhs.inserted_at, lhs.updated_at
               Buffers: shared hit=4 read=6
         ->  Limit  (cost=224541.04..224541.05 rows=1 width=34) (actual time=606.348..606.348 rows=1 loops=10)
               Output: entities.id, entities.full_name
               Buffers: shared hit=479649 read=64091 written=295
               ->  Sort  (cost=224541.04..224652.30 rows=44503 width=34) (actual time=606.346..606.346 rows=1 loops=10)
                     Output: entities.id, entities.full_name
                     Sort Key: entities.full_name
                     Sort Method: quicksort  Memory: 25kB
                     Buffers: shared hit=479649 read=64091 written=295
                     ->  Bitmap Heap Scan on temp.entities  (cost=102831.00..224318.53 rows=44503 width=34) (actual time=606.318..606.336 rows=2 loops=10)
                           Output: entities.id, entities.full_name
                           Recheck Cond: (((entities.full_name)::text % (lhs.full_name)::text) AND (entities.id < lhs.id))
                           Rows Removed by Index Recheck: 1
                           Heap Blocks: exact=30
                           Buffers: shared hit=479649 read=64091 written=295
                           ->  BitmapAnd  (cost=102831.00..102831.00 rows=44503 width=0) (actual time=606.297..606.297 rows=0 loops=10)
                                 Buffers: shared hit=479644 read=64066 written=295
                                 ->  Bitmap Index Scan on entities_full_name_gin_trgm_ops_index  (cost=0.00..882.62 rows=133509 width=0) (actual time=83.707..83.707 rows=5 loops=10)
                                       Index Cond: ((entities.full_name)::text % (lhs.full_name)::text)
                                       Buffers: shared hit=20153 read=11997 written=40
                                 ->  Bitmap Index Scan on entities_id_index  (cost=0.00..101925.88 rows=4450309 width=0) (actual time=522.579..522.579 rows=13350920 loops=10)
                                       Index Cond: (entities.id < lhs.id)
                                       Buffers: shared hit=459491 read=52069 written=255
 Settings: temp_buffers = '64MB', work_mem = '128MB', max_parallel_workers_per_gather = '8', enable_seqscan = 'off'
 Planning:
   Buffers: shared read=1
 Planning Time: 0.140 ms
 Execution Time: 6063.627 ms
(36 rows)

还有其他建议可以加快速度吗?

postgresql database-indexes pg-trgm
1个回答
0
投票

它必须在不知道将在横向子查询中搜索什么字符串的情况下规划查询,因此它必须对将返回多少行做出通用假设。

%
的通用假设是 1/100,id 不等式的通用假设是 1/3。通常,对行的高估计是保守的,但在 LIMIT 的情况下,情况恰恰相反,因为它认为有很多行,它可以轻松满足 LIMIT 然后停止。

当您提取子查询并使用文字值运行它时,它可以使用这些文字值进行规划。 (尽管在这种情况下,对于

%
来说这样做没有意义,只是字面值促使它使用更具选择性的不同通用估计)。如果您想查看通用计划是什么样子,您可以这样做:

explain (generic_plan) select id, full_name
  from temp.entities
  where full_name % $1 and id < $2
  limit 1;

在使用无偿 ORDER BY 的新查询中,大部分时间都花在为“id”条件构建巨大的位图上。它知道这会很慢,所以我不知道为什么要这样做。 (在我手中,它不会这样做,它将该条件应用为过滤器而不是位图)。您可能可以通过像这样编写查询的该部分来强制它停止使用 id 索引:

and id::text < lhs.id::text

由于没有办法强制规划者使用位图,并且当它不想使用时,并且由于在我手中它也不想使用,所以我没有办法探索可能会发生什么来使它使用 BitmapAnd 为您服务。如果您可以展示使用我建议的重写获得的计划,这可能会给我们一些线索。

© www.soinside.com 2019 - 2024. All rights reserved.