我想在非常大的表上优化一个很长的 SQL 查询(postgre)。
查询非常简单:
select pi2.*
from preval_item pi2
join preval_shipment ps on ps.pvs_shipping_number = pi2.pvs_shipping_number
where ps.pvs_shipping_date < to_date('09/01/2022', 'DD/MM/YYYY')
and ps.shs_id = '30';
表
preval_items
包含4600万行(和39G数据),preval_shipment
包含340 000行(和100M数据)。
两个表上都有正确的索引。
查询时间超过15分钟。
我尝试像这样转换查询:
with pvs as materialized (
select pi2.*
from preval_item pi2
join preval_shipment ps on ps.pvs_shipping_number = pi2.pvs_shipping_number
where
ps.pvs_shipping_date < to_date('09/01/2022','DD/MM/YYYY')
and ps.shs_id = '30'
)
select pi.pvs_shipping_number
from preval_item pi
join pvs on pvs.pvs_shipping_number = pi.pvs_shipping_number;
没有成功。
但是最后一个查询中有趣的是,我们可以清楚地看到中间选择(参见 with pvs as Materialized....)只需要几秒钟。这两个表之间的连接确实非常昂贵。
当我进行解释时,我得到了这个(但对我来说很难解释):
Merge Join (cost=13845931.39..950104909.06 rows=62349421560 width=11) |
Merge Cond: ((pvs.pvs_shipping_number)::text = (pi.pvs_shipping_number)::text) |
CTE pvs |
-> Gather (cost=16337.02..7189241.05 rows=29442834 width=821) |
Workers Planned: 2 |
-> Parallel Hash Join (cost=15337.02..4243957.65 rows=12267848 width=821) |
Hash Cond: ((pi2.pvs_shipping_number)::text = (ps.pvs_shipping_number)::text) |
-> Parallel Seq Scan on preval_item pi2 (cost=0.00..4177367.72 rows=19524672 width=821) |
-> Parallel Hash (cost=14234.83..14234.83 rows=88175 width=11) |
-> Parallel Seq Scan on preval_shipment ps (cost=0.00..14234.83 rows=88175 width=11) |
Filter: (((shs_id)::text = '30'::text) AND (pvs_shipping_date < to_date('09/01/2022'::text, 'DD/MM/YYYY'::text)))|
-> Sort (cost=6656689.78..6730296.87 rows=29442834 width=38) |
Sort Key: pvs.pvs_shipping_number |
-> CTE Scan on pvs (cost=0.00..588856.68 rows=29442834 width=38) |
-> Materialize (cost=0.56..987588.70 rows=46859212 width=11) |
-> Index Only Scan using idx_shipping_number on preval_item pi (cost=0.56..870440.67 rows=46859212 width=11)
|
我是否应该得出结论,是音量出了问题,除了等待之外没有其他解决方案?
编辑: 约束定义。我们看到“pvs_shipping_number”上的 pk
正确的索引是:
CREATE INDEX X001 ON preval_shipment (shs_id, pvs_shipping_date, pvs_shipping_number);
CREATE INDEX X002 ON preval_item (pvs_shipping_number);
按照键中列的确切顺序或根据数据的分散程度颠倒 shs_id 和 pvs_shipping_date。
为什么在 WHERE 子句中将 ps.shs_id 设置为“30”?是字符串还是数字?