一些背景: 我在具有 128GB 内存的 Postgres 15.3 服务器上运行以下简单的 select 语句,与我的想法相反,它需要大约 6 分钟。 中的语句涉及两个关系,
big_table_70m
大约有 70M 行,而 other_table_50m
则略小于 50M 行。
在这两个表中,我都有 esat_last_modified
和 last_modified
上的 Btree 索引。
返回需要6分钟:
select
count(*)
FROM
big_table_70m
where
big_table_70m.esat_last_modified > (
select
max(esat_last_modified)
from
other_table_50m
)
OR big_table_70m.task_last_modified > (
select
max(last_modified)
from
other_table_50m
);
但是,当删除 OR 条件并在任一侧执行查询时,它返回速度非常快(>70ms):
select
count(*)
FROM
big_table_70m
where
big_table_70m.esat_last_modified > (
select
max(esat_last_modified)
from
other_table_50m
);
我尝试在 (
esat_last_modified
,last_modified
) 上添加多列索引以提高效率,但没有成功。
您能帮我改进这个查询吗?最终返回计数应该是几千行。
解释分析:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=1765894.63..1765894.64 rows=1 width=8) (actual time=403477.117..403477.314 rows=1 loops=1)
Buffers: shared hit=77762962 read=848381 dirtied=27535
I/O Timings: shared/local read=380517.629
InitPlan 2 (returns $1)
-> Result (cost=0.70..0.71 rows=1 width=8) (actual time=0.028..0.029 rows=1 loops=1)
Buffers: shared hit=6
InitPlan 1 (returns $0)
-> Limit (cost=0.56..0.70 rows=1 width=8) (actual time=0.026..0.027 rows=1 loops=1)
Buffers: shared hit=6
-> Index Only Scan Backward using f9b1917d68a5656ca14b2f5e12447c6a on other_table_50m (cost=0.56..4671638.13 rows=33810205 width=8) (actual time=0.025..0.026 rows=1 loops=1)
Index Cond: (esat_last_modified IS NOT NULL)
Heap Fetches: 1
Buffers: shared hit=6
InitPlan 4 (returns $3)
-> Result (cost=0.67..0.68 rows=1 width=8) (actual time=0.013..0.014 rows=1 loops=1)
Buffers: shared hit=5
InitPlan 3 (returns $2)
-> Limit (cost=0.56..0.67 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=1)
Buffers: shared hit=5
-> Index Only Scan Backward using "429b05d0a626f21b4708244ab67da408" on other_table_50m other_table_50m_1 (cost=0.56..5092889.03 rows=47176591 width=8) (actual time=0.013..0.013 rows=1 loops=1)
Index Cond: (last_modified IS NOT NULL)
Heap Fetches: 1
Buffers: shared hit=5
-> Gather (cost=1765893.02..1765893.23 rows=2 width=8) (actual time=403477.112..403477.306 rows=1 loops=1)
Workers Planned: 2
Params Evaluated: $1, $3
Workers Launched: 0
Buffers: shared hit=77762962 read=848381 dirtied=27535
I/O Timings: shared/local read=380517.629
-> Partial Aggregate (cost=1764893.02..1764893.03 rows=1 width=8) (actual time=403476.429..403476.429 rows=1 loops=1)
Buffers: shared hit=77762951 read=848381 dirtied=27535
I/O Timings: shared/local read=380517.629
-> Parallel Index Only Scan using multi_column_index on big_table_70m (cost=0.57..1724191.80 rows=16280489 width=0) (actual time=403476.425..403476.425 rows=0 loops=1)
Filter: ((esat_last_modified > $1) OR (task_last_modified > $3))
Rows Removed by Filter: 70394276
Heap Fetches: 6894859
Buffers: shared hit=77762951 read=848381 dirtied=27535
I/O Timings: shared/local read=380517.629
Planning:
Buffers: shared hit=352
Planning Time: 0.949 ms
Execution Time: 403477.593 ms
(42 rows)
这是depesz的链接 它表明大量时间是由于 IO(从磁盘读取)造成的。
在big_table_70m上添加多列索引但没有成功。 我还重写了选择以使用 CTE,但显然它没有对计划进行任何更改。
我希望该语句以与没有 OR 条件时相同的速度返回。
OR
对于优化器来说通常很难处理,因为它不能在任一列上使用索引,因为这没有考虑到另一列。
有时它可以执行索引并集,但通常无法解决这个问题。所以你需要给它一些帮助。
在表的主键上使用
UNION
,理论上应该允许它使用高效的合并联合,然后再次聚合。
select
count(*)
from (
select bg.PrimaryKeyHere
from big_table_70m bg
where
bg.esat_last_modified > (
select
max(o.esat_last_modified)
from
other_table_50m o
)
union
select bg.PrimaryKeyHere
from big_table_70m bg
where
bg.task_last_modified > (
select
max(o.last_modified)
from
other_table_50m o
)
) bg;
为了使其有效工作,您将需要以下索引:
big_table_70m (esat_last_modified)
big_table_70m (task_last_modified)
other_table_50m (esat_last_modified)
other_table_50m (last_modified)
您可以将其他列添加到键或索引的
INCLUDE
中,但这些列应该放在第一位,并且不应合并到单个索引中。