在非常大的表上优化此查询

问题描述 投票:0回答:1

我想在非常大的表上优化一个很长的 SQL 查询(postgre)。

查询非常简单:

select pi2.*  
from preval_item pi2
join preval_shipment ps on ps.pvs_shipping_number = pi2.pvs_shipping_number
where ps.pvs_shipping_date < to_date('09/01/2022', 'DD/MM/YYYY')
  and ps.shs_id = '30';

preval_items
包含4600万行(和39G数据),
preval_shipment
包含340 000行(和100M数据)。

两个表上都有正确的索引。

查询时间超过15分钟。

我尝试像这样转换查询:

with pvs as materialized (
select pi2.*
from preval_item pi2
join preval_shipment ps on ps.pvs_shipping_number = pi2.pvs_shipping_number
where
    ps.pvs_shipping_date < to_date('09/01/2022','DD/MM/YYYY')
    and ps.shs_id = '30'
)
select pi.pvs_shipping_number
from preval_item pi
join pvs on pvs.pvs_shipping_number = pi.pvs_shipping_number;

没有成功。

但是最后一个查询中有趣的是,我们可以清楚地看到中间选择(参见 with pvs as Materialized....)只需要几秒钟。这两个表之间的连接确实非常昂贵。

当我进行解释时,我得到了这个(但对我来说很难解释):

Merge Join  (cost=13845931.39..950104909.06 rows=62349421560 width=11)                                                                       |
  Merge Cond: ((pvs.pvs_shipping_number)::text = (pi.pvs_shipping_number)::text)                                                             |
  CTE pvs                                                                                                                                    |
    ->  Gather  (cost=16337.02..7189241.05 rows=29442834 width=821)                                                                          |
          Workers Planned: 2                                                                                                                 |
          ->  Parallel Hash Join  (cost=15337.02..4243957.65 rows=12267848 width=821)                                                        |
                Hash Cond: ((pi2.pvs_shipping_number)::text = (ps.pvs_shipping_number)::text)                                                |
                ->  Parallel Seq Scan on preval_item pi2  (cost=0.00..4177367.72 rows=19524672 width=821)                                    |
                ->  Parallel Hash  (cost=14234.83..14234.83 rows=88175 width=11)                                                             |
                      ->  Parallel Seq Scan on preval_shipment ps  (cost=0.00..14234.83 rows=88175 width=11)                                 |
                            Filter: (((shs_id)::text = '30'::text) AND (pvs_shipping_date < to_date('09/01/2022'::text, 'DD/MM/YYYY'::text)))|
  ->  Sort  (cost=6656689.78..6730296.87 rows=29442834 width=38)                                                                             |
        Sort Key: pvs.pvs_shipping_number                                                                                                    |
        ->  CTE Scan on pvs  (cost=0.00..588856.68 rows=29442834 width=38)                                                                   |
  ->  Materialize  (cost=0.56..987588.70 rows=46859212 width=11)                                                                             |
        ->  Index Only Scan using idx_shipping_number on preval_item pi  (cost=0.56..870440.67 rows=46859212 width=11) 

                  |

我是否应该得出结论,是音量出了问题,除了等待之外没有其他解决方案?

编辑: 约束定义。我们看到“pvs_shipping_number”上的 pk

sql postgresql query-optimization
1个回答
0
投票

正确的索引是:

CREATE INDEX X001 ON preval_shipment (shs_id, pvs_shipping_date, pvs_shipping_number);
CREATE INDEX X002 ON preval_item (pvs_shipping_number);

按照键中列的确切顺序或根据数据的分散程度颠倒 shs_id 和 pvs_shipping_date。

为什么在 WHERE 子句中将 ps.shs_id 设置为“30”?是字符串还是数字?

© www.soinside.com 2019 - 2024. All rights reserved.