PostgreSQL 不使用索引进行 NOT IN 查询

Question

我创建了两个表：

```
table1(id, cola, colb)
```
和
```
table2(id, cola, colb)
```
.

对于这两个表，我在

cola

和

colb

上定义了多列唯一索引。我在 table1 中创建了大约 3000 万行，并在 table2 中复制了相同的行。然后我删除了 table2 中大约 60K 行。

我编写了一个查询来获取 table1 中属于

NOT IN

table2 的所有行。想知道为什么查询没有使用索引（需要 37 秒）。另外，当我编写相同的查询来获取 table1 中

IN

table2 的所有行时，它的运行速度要快得多（只需要 16 秒）。

我不明白索引如何很好地工作，但我想知道是否有人可以提供有关正在发生的事情的见解。

表1

CREATE TABLE table1 (
   id BIGSERIAL NOT NULL PRIMARY KEY,
   cola INTEGER NOT NULL,
   colb VARCHAR,
   
   CONSTRAINT unique_table1_cola_colb UNIQUE (cola, colb)
);

表2

CREATE TABLE table2 (
   id BIGSERIAL NOT NULL PRIMARY KEY,
   cola INTEGER NOT NULL,
   colb VARCHAR,
   
   CONSTRAINT unique_table2_cola_colb UNIQUE (cola, colb)
);

获取行数

=> select count(*) from table1;
  count   
----------
 30000000
(1 row)

Time: 1010.109 ms (00:01.010)
=> 
=> select count(*) from table2;
  count   
----------
 29699970
(1 row)

Time: 1022.435 ms (00:01.022)
=>

获取表 1 中而不是表 2 中的行：需要 37 秒

=> EXPLAIN ANALYSE
-> SELECT cola, colb
-> FROM table1
-> WHERE (cola, colb)
-> NOT IN (SELECT cola, colb
(>                FROM table2);
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=534459.46..1440927.29 rows=1753542 width=15) (actual time=7505.507..37367.405 rows=300030 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Hash Anti Join  (cost=533459.46..1264573.09 rows=730642 width=15) (actual time=7506.407..36919.080 rows=100010 loops=3)
         Hash Cond: (table1.cola = table2.cola)
         Join Filter: (((table1.colb)::text = (table2.colb)::text) OR (table2.colb IS NULL) OR ((table1.colb)::text IS NULL))
         Rows Removed by Join Filter: 143549855
         ->  Parallel Seq Scan on table1  (cost=0.00..316248.68 rows=12501468 width=15) (actual time=0.018..1351.725 rows=10000000 loops=3)
         ->  Parallel Hash  (cost=316211.98..316211.98 rows=12497798 width=15) (actual time=3735.958..3735.959 rows=9899990 loops=3)
               Buckets: 262144  Batches: 256  Memory Usage: 7840kB
               ->  Parallel Seq Scan on table2  (cost=0.00..316211.98 rows=12497798 width=15) (actual time=0.013..1345.660 rows=9899990 loops=3)
 Planning Time: 0.861 ms
 Execution Time: 37382.576 ms
(13 rows)

Time: 37501.184 ms (00:37.501)

获取表 1 中而不是表 2 中的行：需要 37 秒

=> EXPLAIN ANALYSE
-> SELECT cola, colb
-> FROM table1
-> WHERE (cola, colb)
-> IN (SELECT cola, colb
(>     FROM table2);
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=565703.95..1130699.24 rows=29 width=15) (actual time=8322.403..14819.665 rows=29699970 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Hash Join  (cost=564703.95..1129696.34 rows=12 width=15) (actual time=8312.194..12729.173 rows=9899990 loops=3)
         Hash Cond: ((table1.cola = table2.cola) AND ((table1.colb)::text = (table2.colb)::text))
         ->  Parallel Seq Scan on table1  (cost=0.00..316248.68 rows=12501468 width=15) (actual time=0.016..1366.878 rows=10000000 loops=3)
         ->  Parallel Hash  (cost=316211.98..316211.98 rows=12497798 width=15) (actual time=4142.989..4142.990 rows=9899990 loops=3)
               Buckets: 262144  Batches: 256  Memory Usage: 7552kB
               ->  Parallel Seq Scan on table2  (cost=0.00..316211.98 rows=12497798 width=15) (actual time=0.013..1376.971 rows=9899990 loops=3)
 Planning Time: 0.399 ms
 Execution Time: 16136.308 ms
(11 rows)

Time: 16253.777 ms (00:16.254)

我期望使用索引，并且

NOT IN

和

IN

的表现大致相同。

Answer 1

NOT IN

无法很好地优化 - 除了隐藏空值的逻辑陷阱之外！看到你的

colb

实际上可以为空，

NOT IN

是个坏主意。

对于此任务，使用其他基本 SQL 技术之一几乎总是更好。参见：

选择其他表中不存在的行

NOT EXISTS

往往是最好的选择：

SELECT cola, colb
FROM   table1 t1
WHERE  NOT EXISTS (
   SELECT FROM table2 t2
   WHERE (t2.cola, t2.colb) = (t1.cola, t1.colb)
   );

您将看到使用索引（仅）扫描的更快计划，例如：

'Merge Anti Join  (cost=11.30..1452575.86 rows=6764677 width=11)'
'  Merge Cond: ((t1.cola = t2.cola) AND ((t1.colb)::text = (t2.colb)::text))'
'  ->  Index Only Scan using unique_table1_cola_colb on table1 t1  (cost=0.56..577013.64 rows=30000032 width=11)'
'  ->  Index Only Scan using unique_table2_cola_colb on table2 t2  (cost=0.56..575877.16 rows=29939960 width=11)'
'JIT:'
'  Functions: 5'
'  Options: Inlining true, Optimization true, Expressions true, Deforming true'

这是 Postgres 16 中 30M 行的基本

EXPLAIN

的结果，其中 60K 行在未优化的设置下丢失。

PostgreSQL 不使用索引进行 NOT IN 查询

问题描述投票：0回答：1

1个回答

最新问题

PostgreSQL 不使用索引进行 NOT IN 查询

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1