我有一个用户、帖子和订阅的表,用户可以在其中订阅其他用户并查看他们最近的帖子。
这是查询:
SELECT "blog_posts".*
FROM "blog_posts"
WHERE (user_id IN (SELECT subscribed_to_id FROM subscriptions WHERE subscriber_id = ?))
ORDER BY "blog_posts"."created_at" LIMIT 50
查询正在获取当前用户订阅的用户 (subscribed_to_id),然后加载这些用户的 50 个最新帖子。
这似乎是每组最大 n 问题的变体,但我没有太多运气,感觉我一定错过了一些明显的东西。
架构:
帖子:
CREATE TABLE blog_posts (
id bigint DEFAULT nextval('blog_posts_id_seq'::regclass) PRIMARY KEY,
message character varying(280) NOT NULL,
user_id bigint NOT NULL REFERENCES users(id),
created_at timestamp(6) without time zone NOT NULL,
updated_at timestamp(6) without time zone NOT NULL
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX blog_posts_pkey ON blog_posts(id int8_ops);
CREATE INDEX index_posts_on_user_id ON blog_posts(user_id int8_ops);
CREATE INDEX idx_posts_user_created ON blog_posts(user_id int8_ops,created_at timestamp_ops DESC);created_at timestamp_ops DESC);
用户:
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email text NOT NULL,
created_at timestamp(6) without time zone NOT NULL,
updated_at timestamp(6) without time zone NOT NULL
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX users_pkey ON users(id int8_ops);
订阅:
CREATE TABLE subscriptions (
id BIGSERIAL PRIMARY KEY,
subscriber_id bigint REFERENCES users(id),
subscribed_to_id bigint REFERENCES users(id),
created_at timestamp(6) without time zone NOT NULL,
updated_at timestamp(6) without time zone NOT NULL
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX subscriptions_pkey ON subscriptions(id int8_ops);
CREATE INDEX index_subscriptions_on_subscribed_to_id ON subscriptions(subscribed_to_id int8_ops);
CREATE UNIQUE INDEX index_subscriptions_on_subscriber_id_and_subscribed_to_id ON subscriptions(subscriber_id int8_ops,subscribed_to_id int8_ops);
CREATE INDEX index_subscriptions_on_subscriber_id ON subscriptions(subscriber_id int8_ops);
随着帖子数量的增加,这个查询变得很慢。现在,在我的测试数据库中,我有 10,000 个用户,每个用户有 4,000 个帖子。当一个用户订阅了另外 5 个用户时,这意味着查询规划器将使用我的
index_posts_on_user_id
索引加载每个用户的所有帖子,然后在内存中执行 ORDER BY 以获得最新的 50 个结果。
随着每个用户的帖子数量增加,或订阅数量增加,速度将继续变慢。
这是一个示例查询计划:https://explain.dalibo.com/plan/gad3g247adbc7gf5
Limit (cost=74084.78..74084.90 rows=50 width=57) (actual time=119.681..119.696 rows=50 loops=1)
Buffers: shared hit=16 read=28029 written=3
-> Sort (cost=74084.78..74134.70 rows=19971 width=57) (actual time=119.679..119.687 rows=50 loops=1)
Sort Key: blog_posts.created_at
Sort Method: top-N heapsort Memory: 37kB
Buffers: shared hit=16 read=28029 written=3
-> Nested Loop (cost=51.68..73421.35 rows=19971 width=57) (actual time=1.389..112.820 rows=28000 loops=1)
Buffers: shared hit=16 read=28029 written=3
-> Index Only Scan using index_subscriptions_on_subscriber_id_and_subscribed_to_id on subscriptions subscriptions (cost=0.29..4.38 rows=5 width=8) (actual time=0.535..0.549 rows=7 loops=1)
Index Cond: (subscriptions.subscriber_id = 9999)
Buffers: shared hit=1 read=2
-> Bitmap Heap Scan on blog_posts blog_posts (cost=51.39..14643.46 rows=3994 width=57) (actual time=1.295..15.097 rows=4000 loops=7)
Recheck Cond: (blog_posts.user_id = subscriptions.subscribed_to_id)
Heap Blocks: exact=28000
Buffers: shared hit=15 read=28027 written=3
-> Bitmap Index Scan on index_posts_on_user_id (cost=0.00..50.39 rows=3994 width=0) (actual time=0.693..0.693 rows=4000 loops=7)
Index Cond: (blog_posts.user_id = subscriptions.subscribed_to_id)
Buffers: shared hit=7 read=35
Planning:
Buffers: shared hit=24 read=14
Execution time: 119.728 ms
对于相关用户,有 7 个用户订阅,因此它在位图索引扫描步骤中加载了 28,000 行。
我尝试扩展索引以包含
created_at
,但这并没有被使用,因为 Postgres 仍然需要加载每个用户的所有行以确定哪些行是最新的。
然后我尝试使用交叉连接:
SELECT b.*
FROM subscriptions s
CROSS JOIN LATERAL (
SELECT *
FROM blog_posts b
WHERE s.subscriber_id = 100
AND b.user_id = s.subscribed_to_id
ORDER BY created_at DESC LIMIT 50
) b
ORDER BY created_at DESC LIMIT 50
计划:https://explain.dalibo.com/plan/d0c7789a11hbh434
Limit (cost=10163257.73..10163257.85 rows=50 width=57) (actual time=31.224..31.238 rows=50 loops=1)
Buffers: shared hit=741
-> Sort (cost=10163257.73..10169483.10 rows=2490150 width=57) (actual time=31.222..31.230 rows=50 loops=1)
Sort Key: b.created_at DESC
Sort Method: top-N heapsort Memory: 35kB
Buffers: shared hit=741
-> Nested Loop (cost=0.56..10080536.74 rows=2490150 width=57) (actual time=0.296..31.139 rows=300 loops=1)
Buffers: shared hit=741
-> Seq Scan on subscriptions s (cost=0.00..914.03 rows=49803 width=16) (actual time=0.007..4.748 rows=49803 loops=1)
Buffers: shared hit=416
-> Limit (cost=0.56..201.39 rows=50 width=57) (actual time=0.000..0.000 rows=0 loops=49803)
Buffers: shared hit=325
-> Result (cost=0.56..16042.46 rows=3994 width=57) (actual time=0.000..0.000 rows=0 loops=49803)
Buffers: shared hit=325
-> Index Scan using idx_posts_user_created on blog_posts b (cost=0.56..16042.46 rows=3994 width=57) (actual time=0.007..0.041 rows=50 loops=6)
Index Cond: (b.user_id = s.subscribed_to_id)
Buffers: shared hit=325
Execution time: 31.267 ms
这减少了从博客文章表中读取的行数,并且看起来速度更快,但它会从订阅表中加载约 50k 行,并且与之相关的成本非常高。
有没有一种方法可以仅使用 SQL 来优化此操作,以便读取的行数保持一定程度的限制,同时仍能获取最新的 50 个帖子?
您指定了“仅使用 SQL”,我认为答案可能是“否”。有一个可行的优化,但仅进行横向连接将无法访问它,因为它需要更详细的查询编写方式。您需要首先从订阅表运行查询,然后使用结果构建如下所示的内容:
SELECT "blog_posts".* FROM (
select * from "blog_posts" where user_id=17 order by created_at limit 50
union_all
select * from "blog_posts" where user_id=23 order by created_at limit 50
union_all
select * from "blog_posts" where user_id=42 order by created_at limit 50
-- union all ... so on
) order by created_at limit 50
这需要一个由列
(user_id, created_at)
组成(或以列开头)的索引,并且它将以交错的方式执行 UNION ALL 的每个分支并“合并追加”结果,一旦获得所需的数量就停止行。
您可以定义一个集合返回函数,该函数将运行“内部”查询并使用结果动态连接这个怪物查询(使用
format()
函数可以帮助安全地做到这一点),然后使用动态 SQL 执行并返回结果。这可能是你最好的解决方案,但仅仅调用 SQL 确实有点像作弊。
对于您在评论中链接到的最后一个计划,您可能可以通过将最终索引扫描设为仅索引扫描来进一步改进。这需要仅选择您需要的列(不使用
*
),然后将所有这些列包含在索引中。但它仍然无法像更复杂的动态构造的查询那样扩展,并且索引中包含如此多的列并保持表的可见性映射足够调整以保持有效可能会很尴尬。