我的问题是如何让MySQL优化器以最有效的方式使用复合索引。
我正在使用 MySQL Server 8,并且有一个 MyISAM 表,其中包含 2 年期间“单元”对象的每日统计信息。每天大约有 51.000 到 57.000 个单元格(行)。表中的列很多 - 大约 860 个计数器。数据库无法标准化,因为所有列都同等重要。该查询为一组用户定义的单元格列表生成大约 840 列统计信息。每一列都是一个 KPI,它是根据一个或多个原始计数器计算得出的。该查询将具有簇定义“clusters_cust”的表与主统计表“h_cell”连接起来。簇中的每个单元与表“h_cell”中同一单元的统计记录相匹配。用户定义一个时间段,然后将报告时间段内每一天的每个集群值的结果进行聚合。
查询如下所示:
SELECT cluster,Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
INNER JOIN clusters_cust ON clusters_cust.cell = h_cell.cell
WHERE cluster='cluster62' AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time
编辑: 根据 TheImpaler 的评论:
您正在连接两个表并对它们应用过滤器。优化器不知道是开始访问表 #1 然后扫描表 #2 更好,还是反之亦然。
在问题的末尾,有两个修改后的查询变体,不幸的是,它们要么表现更差,要么表现相同。
表“h_cell”具有以下结构:
mysql> SHOW CREATE TABLE h_cell;
CREATE TABLE `h_cell` (
`Time` date NOT NULL,
`Cell` char(8) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
`LocalCI` tinyint NOT NULL,
`Integrity` varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
`counter1` int DEFAULT NULL,
`counter2` int DEFAULT NULL,
`counter3` double DEFAULT NULL,
`counter4` float DEFAULT NULL,
...........
`counter860` int DEFAULT NULL,
PRIMARY KEY (`Cell`,`Time`) USING BTREE,
KEY `Time` (`Time`,`LocalCI`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC
表“clusters_cust”具有以下结构:
mysql> SHOW CREATE TABLE clusters_cust;
CREATE TABLE `clusters_cust` (
`Cell` varchar(11) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
`Cluster` varchar(80) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
`Comment` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
PRIMARY KEY (`Cluster`,`Cell`),
KEY `Cell` (`Cell`),
KEY `Comment` (`Comment`,`Cluster`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
“h_cell”表中的索引:
mysql> SHOW INDEX FROM h_cell;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| h_cell | 0 | PRIMARY | 1 | Cell | A | 58258 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 0 | PRIMARY | 2 | Time | A | 39090988 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 1 | Time | 1 | Time | A | 730 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 1 | Time | 2 | LocalCI | A | 15081 | NULL | NULL | | BTREE | | | YES | NULL |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
主键旨在为单元格级别的查询提供服务 - 它显示给定单元格和时间段的统计信息。我希望它也能帮助集群级别的联接查询,但事实并非如此。
下面是 62 个单元集群和 2 个月报告周期的 EXPLAIN 命令。似乎从复合主键中仅使用了第一个成员“Cell”,而不使用“Time”部分:
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| 1 | SIMPLE | clusters_cust | NULL | ref | PRIMARY,Cell | PRIMARY | 322 | const | 63 | 100.00 | Using where; Using index; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | ee_4g_hua.clusters_cust.Cell | 671 | 5.32 | Using index condition |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
对于包含 3.000 个单元且报告期为 2 个月的较大集群,情况是相同的 - 同样仅使用第一个成员“单元”:
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| 1 | SIMPLE | clusters_cust | NULL | ref | PRIMARY,Cell | PRIMARY | 322 | const | 4067 | 100.00 | Using where; Using index; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | ee_4g_hua.clusters_cust.Cell | 671 | 5.32 | Using index condition |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
但是对于包含 3.000 个单元格的相同集群和更短的 1 个月报告期,根本不使用主键,而是使用另一个索引“时间”(该索引是为另一种类型的查询而设计的):
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 1056817 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
对于包含 20.000 个单元且报告周期为 2 个月的更大集群,再次不使用主键,而是使用“时间”索引:
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 2080777 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
我想知道在我的查询或表设计中应该更改什么,以便优化器能够使用主键索引的“Cell”和“Time”成员?这是否可能,或者应该有另一个更有效的索引?
编辑:
WHERE
子句中执行子查询来获取所需的单元格列表,然后在查询结果中使用 IN
运算符。SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT cell FROM clusters_cust WHERE cluster='cluster20k') AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time
EXPLAIN
显示相同的计划,就好像有 JOIN
一样,并且再次使用“时间”索引,而不是 PRIMARY KEY
:
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 2080777 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
JOIN
,并且有一个子查询直接从预过滤表“clust”中获取所需的单元格列表。再次使用 IN
运算符。SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT * FROM clust) and Time>='2018-05-01' and Time<='2018-06-30'
GROUP BY Time
EXPLAIN
的结果如下。这次使用了 PRIMARY KEY
索引,但仅使用了其中的第一列 - 'cell' 而不是两列 - 'cell' 和 'Time'。执行时间太短了,以至于我无法等待查询结束。
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | <subquery2>.cell | 671 | 5.32 | Using index condition |
| 2 | MATERIALIZED | clust | NULL | ALL | NULL | NULL | NULL | NULL | 20000 | 100.00 | NULL |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
由于替代方案不起作用,我建议采用这种替代方案。
在您的 h_cell 表上,有一个关于(时间,单元格)的索引。
对于查询,我还添加了关键字STRAIGHT_JOIN。我还使用表(别名)将每一列限定为相应的列,以便更好地跟踪哪列来自哪个表。
现在,甚至从您自己的数据描述来看,大约 60 天的时间内每天有 51-57k 条记录仍在运行并计算超过 300k 条记录。现在,由于您只关心此示例中的“cluster62”,因此我假设计数较少。另外,为了查询清晰,您不必在每列周围添加
tick
字符... table.column 或 alias.column 为引擎提供显式限定,以防止数据来源含糊不清,例如 TIME
可能是保留字,但 h.time
明确是 h
中的列(h_cell 表的别名)。
SELECT STRAIGHT_JOIN
cc.cluster,
h.Time,
ROUND( SUM( h.counter1 ) / SUM( h.counter1 + h.counter2 ) * 100,3) AS KPI1,
SUM( h.counter1 ) AS KPI2,
.......
SUM( h.counterN ) AS KPI840
FROM
h_cell h
INNER JOIN clusters_cust cc
ON cc.cluster = 'cluster62'
AND h.cell = cc.cell
WHERE
h.Time >= '2018-05-01'
AND h.Time <= '2018-06-30'
GROUP BY
h.Time