带有日期功能的复合索引不允许仅索引扫描?

问题描述 投票:0回答:2

我正在尝试在表上使用复合(多列)索引来帮助创建每日报告计数。我正在使用 Postgres 13,我的表格如下所示:

CREATE TABLE inquiries (
    id bigint NOT NULL,
    identity_id bigint NOT NULL,
    received_at timestamp(0) without time zone NOT NULL,
    purpose_id bigint NOT NULL,
    location_id bigint NOT NULL
);

CREATE INDEX "inquiries_DATE_index" ON inquiries USING btree
   (date(received_at), location_id, purpose_id, identity_id);

我的查询如下所示:

SELECT DATE(received_at), location_id, purpose_id, COUNT(DISTINCT identity_id)
FROM inquiries
WHERE (DATE(received_at) >= $1)
  AND (DATE(received_at) <= $2)
GROUP BY 1, 2, 3

解释输出如下所示:

GroupAggregate  (cost=43703.28..45785.49 rows=10950 width=19)
  Group Key: (date(received_at)), location_id, purpose_id
  ->  Sort  (cost=43703.28..44092.34 rows=155627 width=16)
        Sort Key: (date(received_at)), location_id, purpose_id
        ->  Bitmap Heap Scan on inquiries  (cost=5243.60..27622.21 rows=155627 width=16)
              Recheck Cond: ((date(received_at) >= '2023-11-01'::date) AND (date(received_at) <= '2023-11-30'::date))
              ->  Bitmap Index Scan on "inquiries_DATE_index"  (cost=0.00..5204.70 rows=155627 width=0)
                    Index Cond: ((date(received_at) >= '2023-11-01'::date) AND (date(received_at) <= '2023-11-30'::date))

索引似乎没有帮助,查询需要很长时间才能执行。如果我向表中添加日期列并使用它而不是

date(received_at)
,那么查询效果会更好,查询计划将更改为:

GroupAggregate  (cost=0.43..85199.58 rows=10980 width=19)
  Group Key: pacific_date, location_id, purpose_id
  ->  Index Only Scan using inquiries_pacific_date_index on inquiries  (cost=0.43..77813.12 rows=727666 width=16)
        Index Cond: ((pacific_date >= '2023-11-01'::date) AND (pacific_date <= '2023-11-30'::date))

我想如果我找不到更好的方法我可以这样做,但这似乎有点多余。有没有办法可以编写原始查询,以便更好地利用索引?

sql postgresql indexing query-optimization postgresql-performance
2个回答
1
投票

问题是,由于 PostgreSQL 的限制,它不会对索引使用仅索引扫描,因为索引包含表达式。您必须将

received_at
添加到索引(除了
date(received_at)
之外)才能使其正常工作。


1
投票

直接修复

就像 Laurenz 解释的那样,仅索引扫描目前(第 16 页)受到 Postgres 中极端情况的限制。 说明书:

然而,PostgreSQL 的规划器目前对此还不是很聪明。 案例。它认为查询可以通过仅索引来执行 仅当查询所需的所有可从以下位置获取时才扫描 索引。

手册有更多详细信息。一种解决方法是将 column 本身“包含”在索引中(替换旧的):

CREATE INDEX inquiries_date_plus_idx ON inquiries
   (date(received_at), location_id, purpose_id, identity_id) INCLUDE (received_at);

允许对原始查询进行仅索引扫描。但它也会增加索引的大小 - 在您的情况下每行增加 8 个字节。

小提琴

更好

在裸列上创建索引,不带表达式:

CREATE INDEX inquiries_received_at_plus_idx ON inquiries
   (received_at, location_id, purpose_id, identity_id);

并稍微调整您的查询,使其完全等效

SELECT received_at::date, location_id, purpose_id, COUNT(DISTINCT identity_id)
FROM   inquiries
WHERE  received_at >= $1
AND    received_at <  $2 + 1  -- !
GROUP  BY 1, 2, 3;

输入

$1
$2
必须是类型
date
received_at timestamp
,如问题中所示。

根据我的经验,

count(DISTINCT col)
通常很慢。这可能会更快:

EXPLAIN
SELECT received_at::date, location_id, purpose_id, count(*) AS dist_identities
FROM  (
   SELECT DISTINCT ON (1,2,3,4)
          received_at::date, location_id, purpose_id, identity_id
   FROM   inquiries
   WHERE  received_at >= $1
   AND    received_at <  $2 + 1
   ) sub
GROUP  BY 1, 2, 3;

小提琴

如果每个

(received_at::date, location_id, purpose_id, identity_id)
有很多重复项,则模拟的索引跳过扫描可能会快得多。参见:

升级

过去几年来,Postgres 的每个主要版本都提高了大数据的性能。考虑升级到最新版本 Postgres 16(在撰写本文时)。应该会给你一个立即的、额外的提升。

最新问题
© www.soinside.com 2019 - 2024. All rights reserved.