浅析MySQL 8.0直方图原理
本文分享自华为云社区《【MySQL技术专栏】MySQL8.0直方图介绍》,作者:GaussDB 数据库。
背景
数据库查询优化器负责将SQL查询转换为尽可能高效的执行计划,但因为数据环境不断变化导致优化器对查询数据了解的不够充足,可能无法生成最优的执行计划进而影响查询效率,因此MySQL8.0推出了直方图(histogram)功能来解决该问题。
直方图用于统计字段值的分布情况,向优化器提供统计信息。利用直方图,可以对一张表的一列数据做分布统计,估算where条件中过滤字段的选择率,从而帮助优化器更准确地估计查询过程中的行数,选择更高效的查询计划。
本文将对直方图概念进行介绍,借助举例描述直方图的使用方式,对创建/删除直方图的原理进行浅析,并通过例子说明其应用场景。
MySQL8.0直方图介绍
数据库中,查询优化器所生成执行计划的好坏关乎执行耗时的多少,优化器若是不清楚表中数据的分布情况,可能会导致无法生成最优的执行计划,造成执行时浪费时间。
假设一条SQL语句要查询相等间隔的两个不同时间段内出行的人数,若不知道每个时间段内的人数,优化器会假设人数在两个不同时间段内是均匀分布的。如果两个时间段内人数相差较大,这样优化器估算的统计数据就出现严重偏差,从而可能选择错误的执行计划。那么,如何使优化器比较清楚地知道数据统计情况进而生成好的执行计划呢?
一种解决方法就是,在列上建立直方图,从而近似地获取一列上的数据分布情况。利用好直方图,将会带来很多方面收益:
(1)查询优化:提供关于数据分布的统计信息,帮助优化查询计划,选择合适的索引和优化查询语句,从而提高查询性能;
(2)索引设计:通过分析数据的分布情况,帮助确定哪些列适合创建索引,以提高查询效率;
(3)数据分析:提供数据的分布情况,帮助用户了解数据的特征和趋势。
直方图分为两类:等宽直方图(singleton)和等高直方图(equi-height)。等宽直方图是每个桶保存一个值以及这个值累积频率:
SCHEMA_NAME: xxx//库名
TABLE_NAME: xxx//表名
COLUMN_NAME: xxx//列名
HISTOGRAM: {
"buckets":[
[
xxx, //桶中数值
xxx //取值频率
],
......
],
"data-type":"xxx", //数据类型
"null-values":xxx, //是否有NULL值
"collation-id":xxx,
"last-updated":"xxxx-xx-xx xx:xx:xx.xxxxxx", //更新时间
"sampling-rate":xxx, //采样率,1表示采集所有数据
"histogram-type":"singleton", //桶类型,等宽
"number-of-buckets-specified":xxx //桶数量
}
等高直方图每个桶需要保存不同值的个数,上下限以及累积频率等:
SCHEMA_NAME: xxx
TABLE_NAME: xxx
COLUMN_NAME: xxx
HISTOGRAM: {
"buckets":[
[
xxx, //最小值
xxx, //最大值
xxx, //桶值出现的频率
xxx //桶值出现的次数
],
......
],
"data-type":"xxx",
"null-values":xxx,
"collation-id":xxx,
"last-updated":"xxxx-xx-xx xx:xx:xx.xxxxxx",
"sampling-rate":xxx,
"histogram-type":"equi-height", //桶类型,等高
"number-of-buckets-specified":xxx
}
MySQL8.0直方图使用方式
创建和删除直方图时涉及analyze语句,常用语法格式为:
创建直方图:
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] ... [WITH N BUCKETS]
删除直方图:
ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name] ...
具体示例:
mysql> create table t1(c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int,c10 int,c11 int,c12 int,c13 datetime,c14 int,c15 int,c16 int,primary key(c1));
Query OK, 0 rows affected (0.01 sec)
mysql> insert into t1 values(1,2,3,4,5,6,7,8,9,10,11,12,'0000-01-01',14,15,16),(2,2,3,4,5,6,7,8,9,10,11,12,'0500-01-01',14,15,16),(3,2,3,4,5,6,7,8,9,10,11,12,'1000-01-01',14,15,16),(4,2,3,4,5,6,7,8,9,10,11,12,'1500-01-01',14,15,16),(5,2,3,4,5,6,7,8,9,10,11,12,'1500-01-01',14,15,16);
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0
创建直方图:
mysql> analyze table t1 update histogram on c13;
+---------+-----------+----------+------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+-----------+----------+------------------------------------------------+
| test.t1 | histogram | status | Histogram statistics created for column 'c13'. |
+---------+-----------+----------+------------------------------------------------+
1 row in set (0.01 sec)
查看直方图信息:
mysql> select json_pretty(histogram)result from information_schema.column_statistics where table_name = 't1' and column_name = 'c13'\G
*************************** 1. row ***************************
result: {
"buckets": [
[
"0000-01-01 00:00:00.000000", //统计的列值
0.2 //统计的相对频率,下同
],
[
"0500-01-01 00:00:00.000000",
0.4
],
[
"1000-01-01 00:00:00.000000",
0.6
],
[
"1500-01-01 00:00:00.000000",
1.0
]
],
"data-type": "datetime", //统计的数据类型
"null-values": 0.0, //NULL值的比例
"collation-id": 8, //直方图数据的排序规则ID
"last-updated": "2023-09-30 16:05:28.533732", //最近更新直方图的时间
"sampling-rate": 1.0, //直方图构建采样率
"histogram-type": "singleton", //直方图类型,等宽
"number-of-buckets-specified": 100 //桶数量
}
1 row in set (0.00 sec)
删除直方图:
mysql> analyze table t1 drop histogram on c13;
+---------+-----------+----------+------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+-----------+----------+------------------------------------------------+
| test.t1 | histogram | status | Histogram statistics removed for column 'c13'. |
+---------+-----------+----------+------------------------------------------------+
1 row in set (0.00 sec)
MySQL8.0直方图原理浅析
直方图原理整体框架可概括为下图所示:
直方图代码主要包含在sql/histograms路径下,带有equi_height前缀的相关文件涉及等高直方图,带有singleton前缀的相关文件涉及等宽直方图,带有value_map前缀的相关文件涉及保存统计值结构,histogram.h/histogram.cc涉及直方图相关调用接口。
Sql_cmd_analyze_table::handle_histogram_command为对直方图操作的整体处理入口,目前只支持在一张表上进行直方图相关操作。创建直方图的主要调用堆栈如下所示,update_histogram为创建直方图的入口。
mysql_execute_command
->Sql_cmd_analyze_table::execute
->Sql_cmd_analyze_table::handle_histogram_command
->Sql_cmd_analyze_table::update_histogram
->histograms::update_histogram
->prepare_value_maps
->fill_value_maps
->build_histogram
->store_histogram
->dd::cache::Dictionary_client::update
->dd::cache::Storage_adapter::store
->dd::Column_statistics_impl::store_attributes
->histograms::Singleton<xxx>::histogram_to_json
对于创建流程展开描述,prepare_value_maps中主要根据直方图列类型创建对应的value_map做准备,之后利用histogram_generation_max_mem_size参数值(限制生成直方图时所允许使用的最大内存大小)和单行数据大小计算后控制统计采样率,fill_value_maps将反复读取数据填充到对应类型的value_map中,key为列实际值,value为其出现的次数。调用build_histogram以完成对直方图的构建,如果桶个数(num_buckets)比不同值个数(value_map.size())要大,则自动创建一个等宽直方图,否则创建一个等高直方图。两种直方图的创建逻辑分别在Singleton<T>:: build_histogram和Equi_height<T>:: build_histogram中。
构建直方图完成后调用store_histogram,将结果以JSON的形式存储在系统表中,通过INFORMATION_SCHEMA.COLUMN_STATISTICS对用户呈现,histogram_to_json会将直方图结果转换为Json_object格式,例如last-updated使用Json_datetime格式保存、histogram-type使用Json_string格式保存、sampling rate使用Json_double格式保存等,再依次调用json_object->add_clone将各json类型字段保存。
删除直方图的主要堆栈如下所示。drop_histograms逻辑中在删除直方图前会先尝试获取以检查对应直方图是否真的存在,不存在的话就提前终止逻辑,存在则删除。
mysql_execute_command
->Sql_cmd_analyze_table::execute
->Sql_cmd_analyze_table::handle_histogram_command
->Sql_cmd_analyze_table::update_histogram
->histograms::update_histogram
MySQL8.0直方图优化场景
优化方面,如本文在前所描述的直方图作用,利用直方图信息估算where条件中各谓词的选择率,帮助选择最优的执行计划。例如,表存在如下所示数据倾斜场景。
mysql> select sys_id,order_status,count(*) from my_table_1 group by sys_id,order_status order by 1,2,3;
+--------+--------------+----------+
| sys_id | order_status | count(*) |
+--------+--------------+----------+
| 3 | 1 | 1 |
| 3 | 2 | 200766 |
| 3 | 3 | 3353 |
| 3 | 4 | 1325 |
| 5 | 1 | 13 |
| 5 | 2 | 2478373 |
| 5 | 3 | 43243 |
| 5 | 4 | 13529 |
| 6 | 2 | 171388 |
| 6 | 3 | 254 |
| 6 | 4 | 716 |
+--------+--------------+----------+
执行如下SQL语句时,因为存在数据倾斜而优化器未能准确估计导致执行计划选择错误,执行耗时约为1.35s。
mysql> explain analyze select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (1) and t1.create_time >= '2022-09-10 00:00:00' and t1.create_time <= '2022-09-16 23:59:59' order by t1.id desc LIMIT 20\G
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (cost=4163.10 rows=20) (actual time=1350.825..1350.825 rows=0 loops=1)
-> Nested loop left join (cost=4163.10 rows=49) (actual time=1350.825..1350.825 rows=0 loops=1)
-> Filter: ((t1.order_status = 1) and (t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2022-09-10 00:00:00') and (t1.create_time <= TIMESTAMP'2022-09-16 23:59:59')) (cost=215.79 rows=49) (actual time=1350.823..1350.823 rows=0 loops=1)
-> Index scan on t1 using PRIMARY (reverse) (cost=215.79 rows=8828) (actual time=0.088..1209.201 rows=2910194 loops=1)
-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.63 rows=1) (never executed)
通过执行ANALYZE table my_table_1 UPDATE HISTOGRAM ON order_status, sys_id, create_time语句创建直方图后,再次执行上述SQL语句时,执行计划中的索引发生了变化,执行耗时为0.11s。因此可以看出,优化器利用更准确的数据分布信息选择了更优的执行计划。
mysql> explain analyze select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (1) and t1.create_time >= '2022-09-10 00:00:00' and t1.create_time <= '2022-09-16 23:59:59' order by t1.id desc LIMIT 20\G
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (cost=38385.46 rows=20) (actual time=114.217..114.217 rows=0 loops=1)
-> Nested loop left join (cost=38385.46 rows=62764) (actual time=114.216..114.216 rows=0 loops=1)
-> Sort: t1.id DESC, limit input to 20 row(s) per chunk (cost=28200.86 rows=62668) (actual time=114.215..114.215 rows=0 loops=1)
-> Filter: (t1.order_status = 1) (cost=28200.86 rows=62668) (actual time=114.207..114.207 rows=0 loops=1)
-> Index range scan on t1 using idx_sys_id_create_time, with index condition: ((t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2022-09-10 00:00:00') and (t1.create_time <= TIMESTAMP'2022-09-16 23:59:59')) (cost=28200.86 rows=62668) (actual time=0.326..112.912 rows=31142 loops=1)
-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.62 rows=1) (never executed)
另外,当where条件中变量值不同时,优化器也根据数据分布情况选择了准确的执行计划,使得执行效率提高。
mysql> explain format=tree select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (2) and t1.create_time >= '2020-10-01 00:00:00' and t1.create_time <= '2020-10-09 23:59:59' order by t1.id desc LIMIT 20\G
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (cost=13541.27 rows=20)
-> Nested loop left join (cost=13541.27 rows=44)
-> Filter: ((t1.order_status = 2) and (t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2020-10-01 00:00:00') and (t1.create_time <= TIMESTAMP'2020-10-09 23:59:59')) (cost=15.79 rows=44)
-> Index scan on t1 using PRIMARY (reverse) (cost=15.79 rows=338)
-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.25 rows=1)
1 row in set (0.00 sec)
mysql> explain format=tree select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (4) and t1.create_time >= '2020-10-01 00:00:00' and t1.create_time <= '2020-10-09 23:59:59' order by t1.id desc LIMIT 20\G
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (cost=30559.31 rows=20)
-> Nested loop left join (cost=30559.31 rows=55852)
-> Sort: t1.id DESC, limit input to 20 row(s) per chunk (cost=24966.26 rows=55480)
-> Filter: (t1.order_status = 4) (cost=24966.26 rows=55480)
-> Index range scan on t1 using idx_sys_id_create_time, with index condition: ((t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2020-10-01 00:00:00') and (t1.create_time <= TIMESTAMP'2020-10-09 23:59:59')) (cost=24966.26 rows=55480)
-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.25 rows=1)
1 row in set (0.00 sec)
所以,通过所提供的统计信息,帮助优化查询计划进而提高查询性能是如前所述应用直方图的一个收益点。