PostgreSQL 不使用索引进行 NOT IN 查询

问题描述 投票:0回答:1

我创建了两个表:

  • table1(id, cola, colb)
  • table2(id, cola, colb)
    .

对于这两个表,我在

cola
colb
上定义了多列唯一索引。我在 table1 中创建了大约 3000 万行,并在 table2 中复制了相同的行。然后我删除了 table2 中大约 60K 行。

我编写了一个查询来获取 table1 中属于

NOT IN
table2 的所有行。想知道为什么查询没有使用索引(需要 37 秒)。另外,当我编写相同的查询来获取 table1 中
IN
table2 的所有行时,它的运行速度要快得多(只需要 16 秒)。

我不明白索引如何很好地工作,但我想知道是否有人可以提供有关正在发生的事情的见解。

表1

CREATE TABLE table1 (
   id BIGSERIAL NOT NULL PRIMARY KEY,
   cola INTEGER NOT NULL,
   colb VARCHAR,
   
   CONSTRAINT unique_table1_cola_colb UNIQUE (cola, colb)
);

表2

CREATE TABLE table2 (
   id BIGSERIAL NOT NULL PRIMARY KEY,
   cola INTEGER NOT NULL,
   colb VARCHAR,
   
   CONSTRAINT unique_table2_cola_colb UNIQUE (cola, colb)
);

获取行数

=> select count(*) from table1;
  count   
----------
 30000000
(1 row)

Time: 1010.109 ms (00:01.010)
=> 
=> select count(*) from table2;
  count   
----------
 29699970
(1 row)

Time: 1022.435 ms (00:01.022)
=> 

获取表 1 中而不是表 2 中的行:需要 37 秒

=> EXPLAIN ANALYSE
-> SELECT cola, colb
-> FROM table1
-> WHERE (cola, colb)
-> NOT IN (SELECT cola, colb
(>                FROM table2);
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=534459.46..1440927.29 rows=1753542 width=15) (actual time=7505.507..37367.405 rows=300030 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Hash Anti Join  (cost=533459.46..1264573.09 rows=730642 width=15) (actual time=7506.407..36919.080 rows=100010 loops=3)
         Hash Cond: (table1.cola = table2.cola)
         Join Filter: (((table1.colb)::text = (table2.colb)::text) OR (table2.colb IS NULL) OR ((table1.colb)::text IS NULL))
         Rows Removed by Join Filter: 143549855
         ->  Parallel Seq Scan on table1  (cost=0.00..316248.68 rows=12501468 width=15) (actual time=0.018..1351.725 rows=10000000 loops=3)
         ->  Parallel Hash  (cost=316211.98..316211.98 rows=12497798 width=15) (actual time=3735.958..3735.959 rows=9899990 loops=3)
               Buckets: 262144  Batches: 256  Memory Usage: 7840kB
               ->  Parallel Seq Scan on table2  (cost=0.00..316211.98 rows=12497798 width=15) (actual time=0.013..1345.660 rows=9899990 loops=3)
 Planning Time: 0.861 ms
 Execution Time: 37382.576 ms
(13 rows)

Time: 37501.184 ms (00:37.501)

获取表 1 中而不是表 2 中的行:需要 37 秒

=> EXPLAIN ANALYSE
-> SELECT cola, colb
-> FROM table1
-> WHERE (cola, colb)
-> IN (SELECT cola, colb
(>     FROM table2);
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=565703.95..1130699.24 rows=29 width=15) (actual time=8322.403..14819.665 rows=29699970 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Hash Join  (cost=564703.95..1129696.34 rows=12 width=15) (actual time=8312.194..12729.173 rows=9899990 loops=3)
         Hash Cond: ((table1.cola = table2.cola) AND ((table1.colb)::text = (table2.colb)::text))
         ->  Parallel Seq Scan on table1  (cost=0.00..316248.68 rows=12501468 width=15) (actual time=0.016..1366.878 rows=10000000 loops=3)
         ->  Parallel Hash  (cost=316211.98..316211.98 rows=12497798 width=15) (actual time=4142.989..4142.990 rows=9899990 loops=3)
               Buckets: 262144  Batches: 256  Memory Usage: 7552kB
               ->  Parallel Seq Scan on table2  (cost=0.00..316211.98 rows=12497798 width=15) (actual time=0.013..1376.971 rows=9899990 loops=3)
 Planning Time: 0.399 ms
 Execution Time: 16136.308 ms
(11 rows)

Time: 16253.777 ms (00:16.254)

我期望使用索引,并且

NOT IN
IN
的表现大致相同。

sql database postgresql performance indexing
1个回答
0
投票

NOT IN
无法很好地优化 - 除了隐藏空值的逻辑陷阱之外!看到你的
colb
实际上可以为空,
NOT IN
是个坏主意

对于此任务,使用其他基本 SQL 技术之一几乎总是更好。参见:

NOT EXISTS
往往是最好的选择:

SELECT cola, colb
FROM   table1 t1
WHERE  NOT EXISTS (
   SELECT FROM table2 t2
   WHERE (t2.cola, t2.colb) = (t1.cola, t1.colb)
   );

您将看到使用索引(仅)扫描的更快计划,例如:

'Merge Anti Join  (cost=11.30..1452575.86 rows=6764677 width=11)'
'  Merge Cond: ((t1.cola = t2.cola) AND ((t1.colb)::text = (t2.colb)::text))'
'  ->  Index Only Scan using unique_table1_cola_colb on table1 t1  (cost=0.56..577013.64 rows=30000032 width=11)'
'  ->  Index Only Scan using unique_table2_cola_colb on table2 t2  (cost=0.56..575877.16 rows=29939960 width=11)'
'JIT:'
'  Functions: 5'
'  Options: Inlining true, Optimization true, Expressions true, Deforming true'

这是 Postgres 16 中 30M 行的基本

EXPLAIN
的结果,其中 60K 行在未优化的设置下丢失。

© www.soinside.com 2019 - 2024. All rights reserved.