用OR条件优化简单但慢的查询

问题描述 投票:0回答:1

一些背景: 我在具有 128GB 内存的 Postgres 15.3 服务器上运行以下简单的 select 语句,与我的想法相反,它需要大约 6 分钟。 中的语句涉及两个关系,

big_table_70m
大约有 70M 行,而
other_table_50m
则略小于 50M 行。 在这两个表中,我都有
esat_last_modified
last_modified
上的 Btree 索引。

返回需要6分钟:

select
    count(*)
FROM
    big_table_70m
where
    big_table_70m.esat_last_modified > (
        select
            max(esat_last_modified)
        from
            other_table_50m
    )
    OR big_table_70m.task_last_modified > (
        select
            max(last_modified)
        from
            other_table_50m
    );

但是,当删除 OR 条件并在任一侧执行查询时,它返回速度非常快(>70ms):

select
    count(*)
FROM
    big_table_70m
where
    big_table_70m.esat_last_modified > (
        select
            max(esat_last_modified)
        from
            other_table_50m
    );

我尝试在 (

esat_last_modified
,
last_modified
) 上添加多列索引以提高效率,但没有成功。

您能帮我改进这个查询吗?最终返回计数应该是几千行。

解释分析:

                                                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1765894.63..1765894.64 rows=1 width=8) (actual time=403477.117..403477.314 rows=1 loops=1)
   Buffers: shared hit=77762962 read=848381 dirtied=27535
   I/O Timings: shared/local read=380517.629
   InitPlan 2 (returns $1)
     ->  Result  (cost=0.70..0.71 rows=1 width=8) (actual time=0.028..0.029 rows=1 loops=1)
           Buffers: shared hit=6
           InitPlan 1 (returns $0)
             ->  Limit  (cost=0.56..0.70 rows=1 width=8) (actual time=0.026..0.027 rows=1 loops=1)
                   Buffers: shared hit=6
                   ->  Index Only Scan Backward using f9b1917d68a5656ca14b2f5e12447c6a on other_table_50m  (cost=0.56..4671638.13 rows=33810205 width=8) (actual time=0.025..0.026 rows=1 loops=1)
                         Index Cond: (esat_last_modified IS NOT NULL)
                         Heap Fetches: 1
                         Buffers: shared hit=6
   InitPlan 4 (returns $3)
     ->  Result  (cost=0.67..0.68 rows=1 width=8) (actual time=0.013..0.014 rows=1 loops=1)
           Buffers: shared hit=5
           InitPlan 3 (returns $2)
             ->  Limit  (cost=0.56..0.67 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=1)
                   Buffers: shared hit=5
                   ->  Index Only Scan Backward using "429b05d0a626f21b4708244ab67da408" on other_table_50m other_table_50m_1  (cost=0.56..5092889.03 rows=47176591 width=8) (actual time=0.013..0.013 rows=1 loops=1)
                         Index Cond: (last_modified IS NOT NULL)
                         Heap Fetches: 1
                         Buffers: shared hit=5
   ->  Gather  (cost=1765893.02..1765893.23 rows=2 width=8) (actual time=403477.112..403477.306 rows=1 loops=1)
         Workers Planned: 2
         Params Evaluated: $1, $3
         Workers Launched: 0
         Buffers: shared hit=77762962 read=848381 dirtied=27535
         I/O Timings: shared/local read=380517.629
         ->  Partial Aggregate  (cost=1764893.02..1764893.03 rows=1 width=8) (actual time=403476.429..403476.429 rows=1 loops=1)
               Buffers: shared hit=77762951 read=848381 dirtied=27535
               I/O Timings: shared/local read=380517.629
               ->  Parallel Index Only Scan using multi_column_index on big_table_70m  (cost=0.57..1724191.80 rows=16280489 width=0) (actual time=403476.425..403476.425 rows=0 loops=1)
                     Filter: ((esat_last_modified > $1) OR (task_last_modified > $3))
                     Rows Removed by Filter: 70394276
                     Heap Fetches: 6894859
                     Buffers: shared hit=77762951 read=848381 dirtied=27535
                     I/O Timings: shared/local read=380517.629
 Planning:
   Buffers: shared hit=352
 Planning Time: 0.949 ms
 Execution Time: 403477.593 ms
(42 rows)

这是depesz的链接 它表明大量时间是由于 IO(从磁盘读取)造成的。

在big_table_70m上添加多列索引但没有成功。 我还重写了选择以使用 CTE,但显然它没有对计划进行任何更改。

我希望该语句以与没有 OR 条件时相同的速度返回。

sql postgresql performance query-optimization postgresql-15
1个回答
0
投票

OR
对于优化器来说通常很难处理,因为它不能在任一列上使用索引,因为这没有考虑到另一列。

有时它可以执行索引并集,但通常无法解决这个问题。所以你需要给它一些帮助。

在表的主键上使用

UNION
,理论上应该允许它使用高效的合并联合,然后再次聚合。

select
    count(*)
from (
    select bg.PrimaryKeyHere
    from big_table_70m bg
    where
        bg.esat_last_modified > (
            select
                max(o.esat_last_modified)
            from
                other_table_50m o
        )

    union

    select bg.PrimaryKeyHere
    from big_table_70m bg
    where
        bg.task_last_modified > (
            select
                max(o.last_modified)
            from
                other_table_50m o
        )
) bg;

为了使其有效工作,您将需要以下索引:

big_table_70m (esat_last_modified)
big_table_70m (task_last_modified)
other_table_50m (esat_last_modified)
other_table_50m (last_modified)

您可以将其他列添加到键或索引的

INCLUDE
中,但这些列应该放在第一位,并且不应合并到单个索引中。

© www.soinside.com 2019 - 2024. All rights reserved.