大型表的优化联接查询

问题描述 投票:0回答:1

我有一个项目来分析事件日志。目标基于父表(表1),以30秒为间隔比较时间戳。

表1

id                            command                           datetime
---------------------       -------------------               ----------------------
 1                              cat                            2018-11-03 23:29:31
 2                              nmap                           2018-11-03 23:22:32
 3                              ssh                            2018-11-03 23:22:40

事件日志表

    id                              raw                              datetime
---------------                  ------------                  --------------------
     1                              text                         2018-11-03 23:23:10
     2                              text                         2018-11-03 23:23:20

因此,基于table1 datetime,我想输出在时间iterval(例如30秒)中触发的所有事件日志

现在我使用此左连接语句,它对于小型表(小于1 MB)很好用:

    SELECT table1.command as Command,table2.raw as Nginx,Table3.raw as Apache
    FROM Table1
    left join Table2  
    on Table1.datetime::timestamp>= Table2.datetime::timestamp - interval '30 seconds'
    and Table1.datetime::timestamp<= Table2.datetime::timestamp + interval '30 seconds'
    left join 
    Table3  on table1.datetime::timestamp>= Table3.datetime::timestamp - interval '1   seconds'
    and Table1.datetime::timestamp<= Table3.datetime::timestamp + interval '30 seconds'

它工作正常,并提供了我想要的输出,问题是我的表具有200K +行,执行查询需要花费很多时间,这对于真正快速地运行不是至关重要的,但是例如,如果我联接3个表(示例中的Table1),而其他2个表包含200k +行,则查询时间超过5小时。

Bellow是一个解释性陈述,可帮助您理解:

     Nested Loop Left Join  (cost=0.00..930881273799.88 rows=1202913100267 width=1819)
       Join Filter: ((b1.datetime >= (s1.datetime - '00:00:30'::interval)) AND (b1.datetime <= (s1.datetime + '00:00:30'::interval)))
       ->  Nested Loop Left Join  (cost=0.00..60290628.13 rows=36384533 width=1343)
             Join Filter: ((b1.datetime <= s2.datetime) AND (b1.datetime >= (s2.datetime - '00:00:30'::interval)))
             ->  Seq Scan on bash b1  (cost=0.00..75.13 rows=4013 width=34)
             ->  Materialize  (cost=0.00..28885.00 rows=81600 width=1317)
                   ->  Seq Scan on suricata__alert s2  (cost=0.00..15089.00 rows=81600 width=1317)
       ->  Materialize  (cost=0.00..43131.25 rows=297550 width=492)
             ->  Seq Scan on suricata__http s1  (cost=0.00..22755.50 rows=297550 width=492) 

我可以优化Join语句吗?我是否应该采用其他解决方法(使用Views,Indexes?)

sql postgresql query-performance
1个回答
0
投票

仅在WHERE条件为以下形式时才可以使用索引:>

<indexed expression> <operator> <constant>

其中<operator>必须在定义索引的运算符类中,并且<constant>并非必须是常数,而是在索引扫描期间具有固定值。

因此您应该将查询重写为

SELECT table1.command AS Command,
       table2.raw AS Nginx,
       table3.raw AS Apache
FROM Table1
   LEFT JOIN table2  
      ON table2.datetime::timestamp
         BETWEEN table1.datetime::timestamp - interval '30 seconds'
             AND table1.datetime::timestamp + interval '30 seconds'
   LEFT JOIN table3
      ON table3.datetime::timestamp
         BETWEEN table1.datetime::timestamp - interval '30 seconds'
             AND table1.datetime::timestamp + interval '30 seconds';

确保datetimetable2table3列上有索引。

除非您用table1条件限制要从WHERE中检索的行数,否则这可能仍然很慢。

© www.soinside.com 2019 - 2024. All rights reserved.