max() 与 ORDER BY DESC + LIMIT 1 的性能

问题描述 投票:0回答:2

我今天正在对一些缓慢的 SQL 查询进行故障排除,但不太了解下面的性能差异:

当尝试根据某些条件从数据表中提取

max(timestamp)
时,如果存在匹配行,则使用
MAX()
ORDER BY timestamp LIMIT 1
慢,但如果未找到匹配行,则速度要快得多。

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)  
Time: 1314.544 ms

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)  
Time: 10.890 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms 

(timestamp)
(sensor_id, timestamp)
上有索引,我注意到Postgres对这两种情况使用非常不同的查询计划和索引:

QUERY PLAN (ORDER BY)                                              
--------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..9.47 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..396254.63 rows=43823 width=8)
          Join Filter: (data.sensor_id = sensors.id)
          ->  Index Scan using timestamp_ind on data  (cost=0.43..254918.66 rows=4710976 width=12)
          ->  Materialize  (cost=0.00..6.70 rows=2 width=4)
              ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
                  Filter: (station_id = 4)
(7 rows)

QUERY PLAN (MAX)                                               
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3680.59..3680.60 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..3571.03 rows=43823 width=8)
        ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
              Filter: (station_id = 4)
        ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12)
              Index Cond: (sensor_id = sensors.id)
(6 rows)

所以我的两个问题是:

  1. 这种性能差异从何而来?我在这里看到了接受的答案MIN/MAX vs ORDER BY 和 LIMIT,但这似乎不太适用于此。任何好的资源将不胜感激。
  2. 是否有比添加
    EXISTS
    检查更好的方法来提高所有情况下的性能(匹配行与不匹配行)?

编辑解决下面评论中的问题。我保留了上面的初始查询计划以供将来参考:

表格定义:

                                  Table "public.sensors"
        Column        |          Type          |                            Modifiers                            
----------------------+------------------------+-----------------------------------------------------------------
id                    | integer                | not null default nextval('sensors_id_seq'::regclass)
station_id            | integer                | not null
....

Indexes:
    "sensor_primary" PRIMARY KEY, btree (id)
    "ind_station_id" btree (station_id, id)
    "ind_station" btree (station_id)

                                  Table "public.data"
  Column   |           Type           |                            Modifiers                             
-----------+--------------------------+------------------------------------------------------------------
 id        | integer                  | not null default nextval('data_id_seq'::regclass)
 timestamp | timestamp with time zone | not null
 sensor_id | integer                  | not null
 avg       | integer                  |

Indexes:
    "timestamp_ind" btree ("timestamp" DESC)
    "sensor_ind" btree (sensor_id)
    "sensor_ind_timestamp" btree (sensor_id, "timestamp")
    "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)

请注意,在@Erwin 下面的建议之后,我刚刚在

ind_station_id
上添加了
sensors
。时间并没有真正发生巨大变化,仍然是
>1200ms
情况下的
ORDER BY DESC + LIMIT 1
~0.9ms
情况下的
MAX

查询计划:

QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit  (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
  Buffers: shared hit=3418066 read=47326
  ->  Nested Loop  (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
        Join Filter: (data.sensor_id = sensors.id)
        Buffers: shared hit=3418066 read=47326
        ->  Index Scan using timestamp_ind on data  (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
              Buffers: shared hit=3418065 read=47326
        ->  Materialize  (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
              Buffers: shared hit=1
              ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
                    Index Cond: (station_id = 4)
                    Heap Fetches: 0
                    Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)

QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
   Buffers: shared hit=1
   ->  Nested Loop  (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
         Buffers: shared hit=1
         ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
               Index Cond: (station_id = 4)
               Heap Fetches: 0
               Buffers: shared hit=1
         ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12) (never executed)
               Index Cond: (sensor_id = sensors.id)
               Heap Fetches: 0
 Planning time: 0.435 ms
 Execution time: 0.048 ms
 (13 rows)

所以就像前面解释的那样,

ORDER BY
执行了
Scan using timestamp_in on data
,而在
MAX
情况下则没有执行此操作。

Postgres 版本: 来自 Ubuntu 存储库的 Postgres:

PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

请注意,存在

NOT NULL
约束,因此
ORDER BY
不必对空行进行排序。

还要注意,我对差异从何而来非常感兴趣。虽然不理想,但我可以使用

EXISTS (<1ms)
然后使用
SELECT (~11ms)
相对快速地检索数据。

sql postgresql max aggregate sql-limit
2个回答
25
投票

对于性能而言,sensor.station_id 上的 匹配索引

 会带来巨大差异。

max()

ORDER BY DESC
+
LIMIT 1
之间存在实际的
差异
。在默认的升序排序中,空值排序在最后。因此,它首先按降序排序。如果存在则
ORDER BY timestamp DESC LIMIT 1
返回null,而聚合函数
max()
忽略null并返回最新的非空时间戳。
ORDER BY timestamp DESC NULLS LAST LIMIT 1
是等价的。

由于您的专栏

d.timestamp
已定义为
NOT NULL
(如您的更新所示),因此没有有效的区别。具有
DESC NULLS LAST
的索引以及
ORDER BY
中针对
LIMIT
查询的相同子句应该仍然效果最佳。我建议索引...

传感器(station_id,id)
数据(传感器 ID、时间戳 DESC NULLS LAST

我的下面的查询建立在第二个查询的基础上。删除索引

sensor_ind_timestamp
sensor_ind_timestamp_desc
,除非它们有其他用途。

更重要的是,第一个表

sensors
上的过滤器返回很少,但仍然(可能)多行。 Postgres 预计根据您添加的查询计划找到 2 行 (
rows=2
)。
完美的技术是对第二个表 data 进行 索引跳过扫描
(又名松散索引扫描)。但目前尚未实现(从 Postgres 16 开始)。有解决方法。参见:

最好的应该是:

SELECT d.timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4
ORDER  BY d.timestamp DESC NULLS LAST
LIMIT  1;

max()
ORDER BY
/
LIMIT
之间的选择几乎不重要。你也可以:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4;

或者:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT max(timestamp) AS timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ) d
WHERE  s.station_id = 4;

或者甚至使用相关子查询,最短

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM   sensors s
WHERE  station_id = 4;

注意双括号!

LATERAL
子查询的优点是您可以检索所选行的任意列,而不仅仅是最新时间戳(一个值)。

相关:


2
投票

查询计划显示索引名称

timestamp_ind
timestamp_sensor_ind
。但这样的索引无助于搜索特定传感器。

要解析等于查询(如

sensor.id = data.sensor_id
),该列必须是索引中的第一列。尝试添加一个允许在
sensor_id
上搜索的索引,并且在传感器内按时间戳排序:

create index sensor_timestamp_ind on data(sensor_id, timestamp);

添加该索引是否可以加快查询速度?

© www.soinside.com 2019 - 2024. All rights reserved.