Mariadb MySQL 按性能对不同组进行计数

Question

Mariadb 版本

select version();
version()                                |
-----------------------------------------+
10.4.24-MariaDB-1:10.4.24+maria~focal-log|

我有一个类似于以下sql的表。 col 是一个平均版本。 col b,c,d,e 看起来像数据库，模式，表，字段。

CREATE TABLE tt (a int, b varchar(32), c varchar(64), d varchar(64), e varchar(64), f int) 
CREATE INDEX tt_a_IDX USING BTREE ON tt (a,b,c,d,e);

桌数为510114

第一个问题是关于特定版本中获取 col e count 时 2 个查询之间的差异。

sql1. select count(DISTINCT c,d,e),b from tt where a = 1 group by b;
sql2. select sum(count), b from (
                select b,COUNT(DISTINCT e) as count from tt 
                where a = 1
                GROUP BY  b,c,d) tt group by b;

id|select_type|table|type|possible_keys|key     |key_len|ref  |rows  |Extra                   |
--+-----------+-----+----+-------------+--------+-------+-----+------+------------------------+
 1|SIMPLE     |tt   |ref |tt_a_IDX     |tt_a_IDX|5      |const|253768|Using where; Using index|

id|select_type|table     |type|possible_keys|key     |key_len|ref  |rows  |Extra                          |
--+-----------+----------+----+-------------+--------+-------+-----+------+-------------------------------+
 1|PRIMARY    |<derived2>|ALL |             |        |       |     |253768|Using temporary; Using filesort|
 2|DERIVED    |tt        |ref |tt_a_IDX     |tt_a_IDX|5      |const|253768|Using where; Using index       |

1次平均耗时4s，2次仅需数百ms；

q1。为什么第二个sql更快？那么这是否意味着 count(distinct mutil cols...) 可以替换为“group by mutil cols and sum”以获得持久性？

第二个问题是当我在分组条件中添加列a时，计划显示范围类型选择。但事实上，它花费了更多时间。

sql3. select sum(count), b from (
                select b,COUNT(DISTINCT e) as count from tt 
                where a = 1
                GROUP BY  a,b,c,d) temp group by b;

id|select_type|table     |type |possible_keys|key     |key_len|ref|rows  |Extra                                           |
--+-----------+----------+-----+-------------+--------+-------+---+------+------------------------------------------------+
 1|PRIMARY    |<derived2>|ALL  |             |        |       |   |253768|Using temporary; Using filesort                 |
 2|DERIVED    |tt        |range|tt_a_IDX     |tt_a_IDX|689    |   |253768|Using where; Using index for group-by (scanning)|

平均花费约3秒；

q2。会发生什么？

由于 key_len 在计划中显示，我删除索引并仅在 col a 中创建新索引

explain select count(DISTINCT c,d,e),b from tt where a = 1 group by b  ;
explain select sum(count), b from (
                select b,COUNT(DISTINCT e) as count from tt 
                where a = 1
                GROUP BY  b,c,d) temp group by b ;
id|select_type|table|type|possible_keys|key     |key_len|ref  |rows  |Extra                      |
--+-----------+-----+----+-------------+--------+-------+-----+------+---------------------------+
 1|SIMPLE     |tt   |ref |tt_a_IDX     |tt_a_IDX|5      |const|253768|Using where; Using filesort|
id|select_type|table     |type|possible_keys|key     |key_len|ref  |rows  |Extra                          |
--+-----------+----------+----+-------------+--------+-------+-----+------+-------------------------------+
 1|PRIMARY    |<derived2>|ALL |             |        |       |     |253768|Using temporary; Using filesort|
 2|DERIVED    |tt        |ref |tt_a_IDX     |tt_a_IDX|5      |const|253768|Using where; Using filesort    |

在解释中，差别不大，但是两次查询花费更多，大约5s，平均4s

q3。这是否意味着索引(a,b,c,d,e)实际上生效了，而不是像计划show(key_len)那样？

分析结果：

ANALYZE FORMAT=JSON select count(DISTINCT c,d,e),b 
from tt where a = 1 group by b; 

{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 3222.2,
    "table": {
      "table_name": "tt",
      "access_type": "ref",
      "possible_keys": ["tt_a_IDX"],
      "key": "tt_a_IDX",
      "key_length": "5",
      "used_key_parts": ["a"],
      "ref": ["const"],
      "r_loops": 1,
      "rows": 253768,
      "r_rows": 510114,
      "r_total_time_ms": 395.73,
      "filtered": 100,
      "r_filtered": 100,
      "attached_condition": "tt.a <=> 1",
      "using_index": true
    }
  }
}

ANALYZE FORMAT=JSON select sum(count), b
from
    (
    select b, COUNT(DISTINCT e) as count
    from tt where a = 1 GROUP BY b, c, d
    ) temp
group by b;

{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 662.82,
    "filesort": {
      "sort_key": "tt.b",
      "r_loops": 1,
      "r_total_time_ms": 0.0086,
      "r_limit": 200,
      "r_used_priority_queue": false,
      "r_output_rows": 16,
      "r_buffer_size": "2Kb",
      "temporary_table": {
        "table": {
          "table_name": "<derived2>",
          "access_type": "ALL",
          "r_loops": 1,
          "rows": 253768,
          "r_rows": 2652,
          "r_total_time_ms": 0.1939,
          "filtered": 100,
          "r_filtered": 100,
          "materialized": {
            "query_block": {
              "select_id": 2,
              "r_loops": 1,
              "r_total_time_ms": 661.61,
              "table": {
                "table_name": "tt",
                "access_type": "ref",
                "possible_keys": ["tt_a_IDX"],
                "key": "tt_a_IDX",
                "key_length": "5",
                "used_key_parts": ["a"],
                "ref": ["const"],
                "r_loops": 1,
                "rows": 253768,
                "r_rows": 510114,
                "r_total_time_ms": 245.18,
                "filtered": 100,
                "r_filtered": 100,
                "attached_condition": "tt.a <=> 1",
                "using_index": true
              }
            }
          }
        }
      }
    }
  }
}

ANALYZE FORMAT=JSON select sum(count), b
from
    (
    select b, COUNT(DISTINCT e) as count
    from tt where a = 1 GROUP BY a,b, c, d
    ) temp
group by b;

{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 3401.7,
    "filesort": {
      "sort_key": "tt.b",
      "r_loops": 1,
      "r_total_time_ms": 0.0093,
      "r_limit": 200,
      "r_used_priority_queue": false,
      "r_output_rows": 16,
      "r_buffer_size": "2Kb",
      "temporary_table": {
        "table": {
          "table_name": "<derived2>",
          "access_type": "ALL",
          "r_loops": 1,
          "rows": 253768,
          "r_rows": 2652,
          "r_total_time_ms": 0.406,
          "filtered": 100,
          "r_filtered": 100,
          "materialized": {
            "query_block": {
              "select_id": 2,
              "r_loops": 1,
              "r_total_time_ms": 3400.5,
              "table": {
                "table_name": "tt",
                "access_type": "range",
                "possible_keys": ["tt_a_IDX"],
                "key": "tt_a_IDX",
                "key_length": "689",
                "used_key_parts": ["a", "b", "c", "d", "e"],
                "r_loops": 1,
                "rows": 253768,
                "r_rows": 510071,
                "r_total_time_ms": 3091.2,
                "filtered": 100,
                "r_filtered": 100,
                "attached_condition": "tt.a = 1",
                "using_index_for_group_by": "scanning"
              }
            }
          }
        }
      }
    }
  }
}

更新1：我尝试将数据从 mariadb 导入到 mysql8。三个sql没有明显区别。

update2：解释mysql8中的分析结果

EXPLAIN ANALYZE select count(DISTINCT c,d,e),b 
from tt where a = 1 group by b; 
-> Group aggregate: count(distinct tt.c,tt.d,tt.e)  (cost=71893.42 rows=253768) (actual time=0.882..3453.691 rows=16 loops=1)
    -> Covering index lookup on tt using tt_a_IDX (a=1)  (cost=46516.62 rows=253768) (actual time=0.030..411.063 rows=510114 loops=1)

EXPLAIN ANALYZE select sum(count), b
from
    (
    select b, COUNT(DISTINCT e) as count
    from tt where a = 1 GROUP BY b, c, d
    ) temp
group by b;
-> Table scan on <temporary>  (actual time=1113.183..1113.185 rows=16 loops=1)
    -> Aggregate using temporary table  (actual time=1113.182..1113.182 rows=16 loops=1)
        -> Table scan on temp  (cost=97270.23..100444.82 rows=253768) (actual time=1111.180..1111.563 rows=2652 loops=1)
            -> Materialize  (cost=97270.22..97270.22 rows=253768) (actual time=1111.177..1111.177 rows=2652 loops=1)
                -> Group aggregate: count(distinct tt.e)  (cost=71893.42 rows=253768) (actual time=0.153..1109.965 rows=2652 loops=1)
                    -> Covering index lookup on tt using tt_a_IDX (a=1)  (cost=46516.62 rows=253768) (actual time=0.043..398.208 rows=510114 loops=1)

EXPLAIN ANALYZE select sum(count), b
from
    (
    select b, COUNT(DISTINCT e) as count
    from tt where a = 1 GROUP BY a,b, c, d
    ) temp
group by b;
-> Table scan on <temporary>  (actual time=4762.307..4762.309 rows=16 loops=1)
    -> Aggregate using temporary table  (actual time=4762.306..4762.306 rows=16 loops=1)
        -> Table scan on temp  (cost=76130.41..79305.00 rows=253768) (actual time=4760.310..4760.708 rows=2652 loops=1)
            -> Materialize  (cost=76130.40..76130.40 rows=253768) (actual time=4760.308..4760.308 rows=2652 loops=1)
                -> Group aggregate: count(distinct tt.e)  (cost=50753.60 rows=253768) (actual time=0.388..4757.575 rows=2652 loops=1)
                    -> Filter: (tt.a = 1)  (cost=25376.80 rows=253768) (actual time=0.055..4593.786 rows=510071 loops=1)
                        -> Covering index skip scan for deduplication on tt using tt_a_IDX over (a = 1)  (cost=25376.80 rows=253768) (actual time=0.051..4535.554 rows=510071 loops=1)

更新3：数据分布

select count(distinct a) from tt;
count(distinct a)|
-----------------+
                1|
mark: there is only one value for col a, which is 1.

select count(distinct a,b) from tt;
count(distinct a,b)|
-------------------+
                 16|

select count(distinct a,b,c) from tt;
count(distinct a,b,c)|
---------------------+
                   28|

select count(distinct a,b,c,d) from tt;
count(distinct a,b,c,d)|
-----------------------+
                   2652|

select count(distinct a,b,c,d,e) from tt;
count(distinct a,b,c,d,e)|
-------------------------+
                   510071|

select count(distinct a,b,c,d,e,f) from tt;
count(distinct a,b,c,d,e,f)|
---------------------------+
                      49680|

Answer 1

q1 中查询

COUNT(DISTINCT ..)

的实现维护了遍历索引并在最后计算结果时遇到的元素的树。当按顺序遍历这样的列表时，只需要考虑与前一个元素的差异。

这可以更好地实现，我写了功能请求MDEV-32870。还存在适用于主键的现有功能请求 MDEV-10922（可以丢弃任何内部内容并视为

COUNT(*)

）。

其他查询设法一次生成较小的集合，因此设法更快。

Mariadb MySQL 按性能对不同组进行计数

问题描述投票：0回答：1

1个回答

最新问题

Mariadb MySQL 按性能对不同组进行计数

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1