PostgreSQL 中的分组依据和计数

Question

查询：

SELECT COUNT(*) as count_all, 
       posts.id as post_id 
FROM posts 
  INNER JOIN votes ON votes.post_id = posts.id 
GROUP BY posts.id;

在 Postgresql 中返回

记录：

 count_all | post_id
-----------+---------
 1         | 6
 3         | 4
 3         | 5
 3         | 1
 1         | 9
 1         | 10
(6 rows)

我只想检索返回的记录数：

。

我使用子查询来实现我想要的，但这似乎不是最佳的：

SELECT COUNT(*) FROM (
    SELECT COUNT(*) as count_all, posts.id as post_id 
    FROM posts 
    INNER JOIN votes ON votes.post_id = posts.id 
    GROUP BY posts.id
) as x;

如何在 PostgreSQL 中获取此上下文中的记录数？

Answer 1

我认为你只需要

COUNT(DISTINCT post_id) FROM votes

。

请参阅 https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES中的“4.2.7.聚合表达式”部分。

编辑：根据埃尔文的评论纠正了我的粗心错误。

Answer 2

还有

EXISTS

：

SELECT count(*) AS post_ct
FROM   posts p
WHERE  EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);

在 Postgres 中，并且像您可能拥有的那样，在 n 侧有多个条目，它通常比 count(DISTINCT post_id)

更快：

SELECT count(DISTINCT p.id) AS post_ct
FROM   posts p
JOIN   votes v ON v.post_id = p.id;

votes

中每个帖子的行数越多，性能差异就越大。使用

EXPLAIN ANALYZE

进行测试。

count(DISTINCT post_id)

必须读取 all 行，对它们进行排序或散列，然后只考虑相同集合中的第一行。

EXISTS

将仅扫描

votes

（或者最好是

post_id

上的索引），直到找到第一个匹配项。

如果保证

post_id

中的每个

votes

都出现在表

posts

中（使用外键约束强制执行引用完整性），则此短格式相当于较长的格式：

SELECT count(DISTINCT post_id) AS post_ct
FROM   votes;

实际上可能比每个帖子

没有或很少条目

的EXISTS查询更快。

您的查询也可以以更简单的形式工作：

SELECT count(*) AS post_ct
FROM  (
    SELECT FROM posts 
    JOIN   votes ON votes.post_id = posts.id 
    GROUP  BY posts.id
    ) sub;

基准

为了验证我的说法，我在资源有限的测试服务器上运行了基准测试。全部都在一个单独的模式中：

测试设置

伪造典型的帖子/投票情况：

CREATE SCHEMA y;
SET search_path = y;

CREATE TABLE posts (
  id   int PRIMARY KEY
, post text
);

INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int)  -- random text
FROM   generate_series(1,10000) g;

DELETE FROM posts WHERE random() > 0.9;  -- create ~ 10 % dead tuples

CREATE TABLE votes (
  vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);

INSERT INTO votes (post_id, up_down)
SELECT g.* 
FROM  (
   SELECT ((random()* 21)^3)::int + 1111 AS post_id  -- uneven distribution
        , random()::int::bool AS up_down
   FROM   generate_series(1,70000)
   ) g
JOIN   posts p ON p.id = g.post_id;

以下所有查询都返回相同的结果（9107 个帖子中的 8093 个帖子获得了投票）。
我使用

EXPLAIN ANALYZE

ant 运行了 4 次测试，在 Postgres 9.1.4 上对三个查询中的每一个进行了五次测试中最好的测试，并附加了结果 总运行时间。

原样。
之后..
```
ANALYZE posts;
ANALYZE votes;
```
之后..
```
CREATE INDEX foo on votes(post_id);
```

之后..

VACUUM FULL ANALYZE posts;
CLUSTER votes using foo;

count(*) ... WHERE EXISTS

253 毫秒
220 毫秒
85 ms -- winner（对帖子进行序列扫描，对选票进行索引扫描，嵌套循环）
85 毫秒

count(DISTINCT x)

- 带连接的长格式

354 毫秒
358 毫秒
373 毫秒 --（帖子索引扫描、投票索引扫描、合并连接）
330 毫秒

count(DISTINCT x)

- 没有连接的简短形式

164 毫秒
164 毫秒
164 ms --（始终顺序扫描）
142 毫秒

有问题的原始查询的最佳时间：

353 毫秒

对于简化版：

348 毫秒

@wildplasser 的 CTE 查询 使用与长形式相同的计划（帖子上的索引扫描、投票上的索引扫描、合并连接）加上 CTE 的一点开销。最佳时间：

366 毫秒

即将推出的 PostgreSQL 9.2 中的仅索引扫描可以改善每个查询的结果，尤其是

EXISTS

。

Postgres 9.5 的相关更详细基准（实际上检索不同的行，而不仅仅是计数）：

选择每个 GROUP BY 组中的第一行？

Answer 3

使用

OVER()

和

LIMIT 1

：

SELECT COUNT(1) OVER()
FROM posts 
   INNER JOIN votes ON votes.post_id = posts.id 
GROUP BY posts.id
LIMIT 1;

Answer 4

WITH uniq AS (
        SELECT DISTINCT posts.id as post_id
        FROM posts
        JOIN votes ON votes.post_id = posts.id
        -- GROUP BY not needed anymore
        -- GROUP BY posts.id
        )
SELECT COUNT(*)
FROM uniq;

Answer 5

对于关注者来说，我喜欢OP的内部查询方法：

SELECT COUNT(*) FROM (
    SELECT COUNT(*) as count_all, posts.id as post_id 
    FROM posts 
    INNER JOIN votes ON votes.post_id = posts.id 
    GROUP BY posts.id
) as x;

从那时起你也可以在那里使用HAVING：

SELECT COUNT(*) FROM (
    SELECT COUNT(*) as count_all, posts.id as post_id 
    FROM posts 
    INNER JOIN votes ON votes.post_id = posts.id 
    GROUP BY posts.id HAVING count(*) > 1
) as x;

或等效的CTE

with posts_coalesced as (
     SELECT COUNT(*) as count_all, posts.id as post_id 
        FROM posts 
        INNER JOIN votes ON votes.post_id = posts.id 
        GROUP BY posts.id )

select count(*) from posts_coalesced;

PostgreSQL 中的分组依据和计数

问题描述投票：0回答：5

5个回答

基准

测试设置

`count(*) ... WHERE EXISTS`

`count(DISTINCT x)`
- 带连接的长格式

`count(DISTINCT x)`
- 没有连接的简短形式

最新问题

PostgreSQL 中的分组依据和计数

问题描述 投票：0回答：5

5个回答

基准

测试设置

count(*) ... WHERE EXISTS

count(DISTINCT x) - 带连接的长格式

count(DISTINCT x) - 没有连接的简短形式

最新问题

问题描述投票：0回答：5

`count(*) ... WHERE EXISTS`

`count(DISTINCT x)`
- 带连接的长格式

`count(DISTINCT x)`
- 没有连接的简短形式