查询:
SELECT COUNT(*) as count_all,
posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id;
在 Postgresql 中返回
n
记录:
count_all | post_id
-----------+---------
1 | 6
3 | 4
3 | 5
3 | 1
1 | 9
1 | 10
(6 rows)
我只想检索返回的记录数:
6
。
我使用子查询来实现我想要的,但这似乎不是最佳的:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
如何在 PostgreSQL 中获取此上下文中的记录数?
我认为你只需要
COUNT(DISTINCT post_id) FROM votes
。
请参阅 https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES中的“4.2.7.聚合表达式”部分。
编辑:根据埃尔文的评论纠正了我的粗心错误。
EXISTS
:
SELECT count(*) AS post_ct
FROM posts p
WHERE EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);
在 Postgres 中,并且像您可能拥有的那样,在 n 侧有多个条目,它通常比 count(DISTINCT post_id)
更快:
SELECT count(DISTINCT p.id) AS post_ct
FROM posts p
JOIN votes v ON v.post_id = p.id;
votes
中每个帖子的行数越多,性能差异就越大。使用 EXPLAIN ANALYZE
进行测试。
count(DISTINCT post_id)
必须读取 all 行,对它们进行排序或散列,然后只考虑相同集合中的第一行。 EXISTS
将仅扫描 votes
(或者最好是 post_id
上的索引),直到找到第一个匹配项。
如果保证
post_id
中的每个votes
都出现在表posts
中(使用外键约束强制执行引用完整性),则此短格式相当于较长的格式:
SELECT count(DISTINCT post_id) AS post_ct
FROM votes;
实际上可能比每个帖子
没有或很少条目的
EXISTS
查询更快。
您的查询也可以以更简单的形式工作:
SELECT count(*) AS post_ct
FROM (
SELECT FROM posts
JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) sub;
为了验证我的说法,我在资源有限的测试服务器上运行了基准测试。全部都在一个单独的模式中:
伪造典型的帖子/投票情况:
CREATE SCHEMA y;
SET search_path = y;
CREATE TABLE posts (
id int PRIMARY KEY
, post text
);
INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int) -- random text
FROM generate_series(1,10000) g;
DELETE FROM posts WHERE random() > 0.9; -- create ~ 10 % dead tuples
CREATE TABLE votes (
vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);
INSERT INTO votes (post_id, up_down)
SELECT g.*
FROM (
SELECT ((random()* 21)^3)::int + 1111 AS post_id -- uneven distribution
, random()::int::bool AS up_down
FROM generate_series(1,70000)
) g
JOIN posts p ON p.id = g.post_id;
以下所有查询都返回相同的结果(9107 个帖子中的 8093 个帖子获得了投票)。
我使用
EXPLAIN ANALYZE
ant 运行了 4 次测试,在 Postgres 9.1.4 上对三个查询中的每一个进行了五次测试中最好的测试,并附加了结果 总运行时间。
原样。
之后..
ANALYZE posts;
ANALYZE votes;
之后..
CREATE INDEX foo on votes(post_id);
之后..
VACUUM FULL ANALYZE posts;
CLUSTER votes using foo;
count(*) ... WHERE EXISTS
count(DISTINCT x)
- 带连接的长格式count(DISTINCT x)
- 没有连接的简短形式有问题的原始查询的最佳时间:
对于简化版:
@wildplasser 的 CTE 查询 使用与长形式相同的计划(帖子上的索引扫描、投票上的索引扫描、合并连接)加上 CTE 的一点开销。最佳时间:
即将推出的 PostgreSQL 9.2 中的仅索引扫描可以改善每个查询的结果,尤其是
EXISTS
。
Postgres 9.5 的相关更详细基准(实际上检索不同的行,而不仅仅是计数):
使用
OVER()
和LIMIT 1
:
SELECT COUNT(1) OVER()
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
LIMIT 1;
WITH uniq AS (
SELECT DISTINCT posts.id as post_id
FROM posts
JOIN votes ON votes.post_id = posts.id
-- GROUP BY not needed anymore
-- GROUP BY posts.id
)
SELECT COUNT(*)
FROM uniq;
对于关注者来说,我喜欢OP的内部查询方法:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
从那时起你也可以在那里使用HAVING:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id HAVING count(*) > 1
) as x;
或等效的CTE
with posts_coalesced as (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id )
select count(*) from posts_coalesced;