我创建了两个表:
table1(id, cola, colb)
和table2(id, cola, colb)
.对于这两个表,我在
cola
和 colb
上定义了多列唯一索引。我在 table1 中创建了大约 3000 万行,并在 table2 中复制了相同的行。然后我删除了 table2 中大约 60K 行。
我编写了一个查询来获取 table1 中属于
NOT IN
table2 的所有行。想知道为什么查询没有使用索引(需要 37 秒)。另外,当我编写相同的查询来获取 table1 中 IN
table2 的所有行时,它的运行速度要快得多(只需要 16 秒)。
我不明白索引如何很好地工作,但我想知道是否有人可以提供有关正在发生的事情的见解。
表1
CREATE TABLE table1 (
id BIGSERIAL NOT NULL PRIMARY KEY,
cola INTEGER NOT NULL,
colb VARCHAR,
CONSTRAINT unique_table1_cola_colb UNIQUE (cola, colb)
);
表2
CREATE TABLE table2 (
id BIGSERIAL NOT NULL PRIMARY KEY,
cola INTEGER NOT NULL,
colb VARCHAR,
CONSTRAINT unique_table2_cola_colb UNIQUE (cola, colb)
);
获取行数
=> select count(*) from table1;
count
----------
30000000
(1 row)
Time: 1010.109 ms (00:01.010)
=>
=> select count(*) from table2;
count
----------
29699970
(1 row)
Time: 1022.435 ms (00:01.022)
=>
获取表 1 中而不是表 2 中的行:需要 37 秒
=> EXPLAIN ANALYSE
-> SELECT cola, colb
-> FROM table1
-> WHERE (cola, colb)
-> NOT IN (SELECT cola, colb
(> FROM table2);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=534459.46..1440927.29 rows=1753542 width=15) (actual time=7505.507..37367.405 rows=300030 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Anti Join (cost=533459.46..1264573.09 rows=730642 width=15) (actual time=7506.407..36919.080 rows=100010 loops=3)
Hash Cond: (table1.cola = table2.cola)
Join Filter: (((table1.colb)::text = (table2.colb)::text) OR (table2.colb IS NULL) OR ((table1.colb)::text IS NULL))
Rows Removed by Join Filter: 143549855
-> Parallel Seq Scan on table1 (cost=0.00..316248.68 rows=12501468 width=15) (actual time=0.018..1351.725 rows=10000000 loops=3)
-> Parallel Hash (cost=316211.98..316211.98 rows=12497798 width=15) (actual time=3735.958..3735.959 rows=9899990 loops=3)
Buckets: 262144 Batches: 256 Memory Usage: 7840kB
-> Parallel Seq Scan on table2 (cost=0.00..316211.98 rows=12497798 width=15) (actual time=0.013..1345.660 rows=9899990 loops=3)
Planning Time: 0.861 ms
Execution Time: 37382.576 ms
(13 rows)
Time: 37501.184 ms (00:37.501)
获取表 1 中而不是表 2 中的行:需要 37 秒
=> EXPLAIN ANALYSE
-> SELECT cola, colb
-> FROM table1
-> WHERE (cola, colb)
-> IN (SELECT cola, colb
(> FROM table2);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=565703.95..1130699.24 rows=29 width=15) (actual time=8322.403..14819.665 rows=29699970 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Join (cost=564703.95..1129696.34 rows=12 width=15) (actual time=8312.194..12729.173 rows=9899990 loops=3)
Hash Cond: ((table1.cola = table2.cola) AND ((table1.colb)::text = (table2.colb)::text))
-> Parallel Seq Scan on table1 (cost=0.00..316248.68 rows=12501468 width=15) (actual time=0.016..1366.878 rows=10000000 loops=3)
-> Parallel Hash (cost=316211.98..316211.98 rows=12497798 width=15) (actual time=4142.989..4142.990 rows=9899990 loops=3)
Buckets: 262144 Batches: 256 Memory Usage: 7552kB
-> Parallel Seq Scan on table2 (cost=0.00..316211.98 rows=12497798 width=15) (actual time=0.013..1376.971 rows=9899990 loops=3)
Planning Time: 0.399 ms
Execution Time: 16136.308 ms
(11 rows)
Time: 16253.777 ms (00:16.254)
我期望使用索引,并且
NOT IN
和 IN
的表现大致相同。
NOT IN
无法很好地优化 - 除了隐藏空值的逻辑陷阱之外!看到你的 colb
实际上可以为空,NOT IN
是个坏主意。
对于此任务,使用其他基本 SQL 技术之一几乎总是更好。参见:
NOT EXISTS
往往是最好的选择:
SELECT cola, colb
FROM table1 t1
WHERE NOT EXISTS (
SELECT FROM table2 t2
WHERE (t2.cola, t2.colb) = (t1.cola, t1.colb)
);
您将看到使用索引(仅)扫描的更快计划,例如:
'Merge Anti Join (cost=11.30..1452575.86 rows=6764677 width=11)'
' Merge Cond: ((t1.cola = t2.cola) AND ((t1.colb)::text = (t2.colb)::text))'
' -> Index Only Scan using unique_table1_cola_colb on table1 t1 (cost=0.56..577013.64 rows=30000032 width=11)'
' -> Index Only Scan using unique_table2_cola_colb on table2 t2 (cost=0.56..575877.16 rows=29939960 width=11)'
'JIT:'
' Functions: 5'
' Options: Inlining true, Optimization true, Expressions true, Deforming true'
这是 Postgres 16 中 30M 行的基本
EXPLAIN
的结果,其中 60K 行在未优化的设置下丢失。