我正在尝试在表上使用复合(多列)索引来帮助创建每日报告计数。我正在使用 Postgres 13,我的表格如下所示:
CREATE TABLE inquiries (
id bigint NOT NULL,
identity_id bigint NOT NULL,
received_at timestamp(0) without time zone NOT NULL,
purpose_id bigint NOT NULL,
location_id bigint NOT NULL
);
CREATE INDEX "inquiries_DATE_index" ON inquiries USING btree
(date(received_at), location_id, purpose_id, identity_id);
我的查询如下所示:
SELECT DATE(received_at), location_id, purpose_id, COUNT(DISTINCT identity_id)
FROM inquiries
WHERE (DATE(received_at) >= $1)
AND (DATE(received_at) <= $2)
GROUP BY 1, 2, 3
解释输出如下所示:
GroupAggregate (cost=43703.28..45785.49 rows=10950 width=19)
Group Key: (date(received_at)), location_id, purpose_id
-> Sort (cost=43703.28..44092.34 rows=155627 width=16)
Sort Key: (date(received_at)), location_id, purpose_id
-> Bitmap Heap Scan on inquiries (cost=5243.60..27622.21 rows=155627 width=16)
Recheck Cond: ((date(received_at) >= '2023-11-01'::date) AND (date(received_at) <= '2023-11-30'::date))
-> Bitmap Index Scan on "inquiries_DATE_index" (cost=0.00..5204.70 rows=155627 width=0)
Index Cond: ((date(received_at) >= '2023-11-01'::date) AND (date(received_at) <= '2023-11-30'::date))
索引似乎没有帮助,查询需要很长时间才能执行。如果我向表中添加日期列并使用它而不是
date(received_at)
,那么查询效果会更好,查询计划将更改为:
GroupAggregate (cost=0.43..85199.58 rows=10980 width=19)
Group Key: pacific_date, location_id, purpose_id
-> Index Only Scan using inquiries_pacific_date_index on inquiries (cost=0.43..77813.12 rows=727666 width=16)
Index Cond: ((pacific_date >= '2023-11-01'::date) AND (pacific_date <= '2023-11-30'::date))
我想如果我找不到更好的方法我可以这样做,但这似乎有点多余。有没有办法可以编写原始查询,以便更好地利用索引?
问题是,由于 PostgreSQL 的限制,它不会对索引使用仅索引扫描,因为索引包含表达式。您必须将
received_at
添加到索引(除了 date(received_at)
之外)才能使其正常工作。
就像 Laurenz 解释的那样,仅索引扫描目前(第 16 页)受到 Postgres 中极端情况的限制。 说明书:
然而,PostgreSQL 的规划器目前对此还不是很聪明。 案例。它认为查询可以通过仅索引来执行 仅当查询所需的所有列可从以下位置获取时才扫描 索引。
手册有更多详细信息。一种解决方法是将 column 本身“包含”在索引中(替换旧的):
CREATE INDEX inquiries_date_plus_idx ON inquiries
(date(received_at), location_id, purpose_id, identity_id) INCLUDE (received_at);
允许对原始查询进行仅索引扫描。但它也会增加索引的大小 - 在您的情况下每行增加 8 个字节。
在裸列上创建索引,不带表达式:
CREATE INDEX inquiries_received_at_plus_idx ON inquiries
(received_at, location_id, purpose_id, identity_id);
并稍微调整您的查询,使其完全等效:
SELECT received_at::date, location_id, purpose_id, COUNT(DISTINCT identity_id)
FROM inquiries
WHERE received_at >= $1
AND received_at < $2 + 1 -- !
GROUP BY 1, 2, 3;
输入
$1
和 $2
必须是类型 date
和 received_at timestamp
,如问题中所示。
根据我的经验,count(DISTINCT col)
通常很慢。这可能会更快:
EXPLAIN
SELECT received_at::date, location_id, purpose_id, count(*) AS dist_identities
FROM (
SELECT DISTINCT ON (1,2,3,4)
received_at::date, location_id, purpose_id, identity_id
FROM inquiries
WHERE received_at >= $1
AND received_at < $2 + 1
) sub
GROUP BY 1, 2, 3;
如果每个
(received_at::date, location_id, purpose_id, identity_id)
有很多重复项,则模拟的索引跳过扫描可能会快得多。参见:
过去几年来,Postgres 的每个主要版本都提高了大数据的性能。考虑升级到最新版本 Postgres 16(在撰写本文时)。应该会给你一个立即的、额外的提升。