慢计数终于解决了

Question

tl; dr：|> Repo.aggregate(:count, :id)很慢，请使用|> Repo.aggregate(:count)

我正在运行一个播客数据库，其中包含超过500万集。存储新剧集后，我将给定播客的剧集计数为这样的计数器缓存：

episodes_count = where(Episode, podcast_id: ^podcast_id)
                 |> Repo.aggregate(:count, :id)

事实证明速度越来越慢。所以我开始更深入地了解，我意识到在Postgres 12中，仅SELECT COUNT(*)不执行索引扫描，而SELECT COUNT(e0.id)不进行索引。

对于冷数据库（刚刚重新启动），即使第一次索引扫描也相当快：

postgres=# \c pan_prod
Sie sind jetzt verbunden mit der Datenbank »pan_prod« als Benutzer »postgres«.
pan_prod=# EXPLAIN ANALYZE SELECT count(*) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=348.51..348.52 rows=1 width=8) (actual time=15.823..15.823 rows=1 loops=1)
   ->  Index Only Scan using episodes_podcast_id_index on episodes e0  (cost=0.43..323.00 rows=10204 width=0) (actual time=1.331..14.832 rows=10613 loops=1)
         Index Cond: (podcast_id = 35202)
         Heap Fetches: 0
 Planning Time: 2.994 ms
 Execution Time: 16.017 ms

第二次扫描甚至更快：

pan_prod=# EXPLAIN ANALYZE SELECT count(*) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=348.51..348.52 rows=1 width=8) (actual time=5.007..5.008 rows=1 loops=1)
   ->  Index Only Scan using episodes_podcast_id_index on episodes e0  (cost=0.43..323.00 rows=10204 width=0) (actual time=0.042..3.548 rows=10613 loops=1)
         Index Cond: (podcast_id = 35202)
         Heap Fetches: 0
 Planning Time: 0.304 ms
 Execution Time: 5.074 ms

虽然第一次位图堆扫描非常慢：

pan_prod=# EXPLAIN ANALYZE SELECT count(e0.id) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=37181.71..37181.72 rows=1 width=8) (actual time=4098.525..4098.526 rows=1 loops=1)
   ->  Bitmap Heap Scan on episodes e0  (cost=219.51..37156.20 rows=10204 width=4) (actual time=6.508..4082.558 rows=10613 loops=1)
         Recheck Cond: (podcast_id = 35202)
         Heap Blocks: exact=6516
         ->  Bitmap Index Scan on episodes_podcast_id_index  (cost=0.00..216.96 rows=10204 width=0) (actual time=3.657..3.658 rows=10613 loops=1)
               Index Cond: (podcast_id = 35202)
 Planning Time: 0.412 ms
 Execution Time: 4098.719 ms

第二个通常更快：

pan_prod=# EXPLAIN ANALYZE SELECT count(e0.id) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=37181.71..37181.72 rows=1 width=8) (actual time=18.857..18.857 rows=1 loops=1)
   ->  Bitmap Heap Scan on episodes e0  (cost=219.51..37156.20 rows=10204 width=4) (actual time=6.047..17.152 rows=10613 loops=1)
         Recheck Cond: (podcast_id = 35202)
         Heap Blocks: exact=6516
         ->  Bitmap Index Scan on episodes_podcast_id_index  (cost=0.00..216.96 rows=10204 width=0) (actual time=3.738..3.738 rows=10613 loops=1)
               Index Cond: (podcast_id = 35202)
 Planning Time: 0.322 ms
 Execution Time: 18.999 ms

我不明白，为什么SELECT count(e0.id)不使用索引，我想知道为什么。我一直认为，我应该更喜欢它，因为只查看一列，但事实并非如此。

Answer 1

我不知道，为什么SELECT count（e0.id）不使用索引，我想知道为什么。我一直认为，我应该更喜欢它，因为只查看一列，但事实并非如此。

它确实都使用索引episodes_podcast_id_index。

一旦它可以不用另外查看表，而另一次它确实可以查询表。

您未提供索引定义，但似乎id不包含在其中。我还要假设它不是not null able列（它可能包含null值）。因此，数据库必须获取id列以查看其是否为null。

为什么？

因为count(*)对行进行计数，而count(<expression>)对非空值进行计数。

参见：https://modern-sql.com/concept/null#aggregates

慢计数终于解决了

问题描述投票：1回答：1

1个回答

最新问题

慢计数终于解决了

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1