MySQL binlog超过binlog_expire_logs_seconds阈值没有删除案例
生产环境有一套3个节点的MySQL InnoDB Cluster,MySQL的版本为Server version: 8.0.35 MySQL Community Server - GPL, 早上突然收到Zabbix的告警,其中一个节点出现空间告警:"/data: Disk space is low (used > 80%)"
检查分析后发现是因为MySQL的binlog没有清理导致空间报警,如下所示(binlog太多,省略了部分内容)
$ du -sh *
78G bin_logs
12K conf
7.3G data
132K logs
4.0K mysql.pid
0 mysql_temp
4.0K slow_logs
$ cd bin_logs
$ ls -lrt
total 81783492
-rw-r----- 1 mysql mysql 909802606 Jan 9 00:10 mysql_binlog.000011
-rw-r----- 1 mysql mysql 984228474 Jan 10 00:10 mysql_binlog.000012
-rw-r----- 1 mysql mysql 980302993 Jan 11 00:10 mysql_binlog.000013
-rw-r----- 1 mysql mysql 985147519 Jan 12 00:10 mysql_binlog.000014
-rw-r----- 1 mysql mysql 962232624 Jan 13 00:10 mysql_binlog.000015
....................................................
....................................................
-rw-r----- 1 mysql mysql 891703862 Apr 7 00:10 mysql_binlog.000101
-rw-r----- 1 mysql mysql 920700543 Apr 8 00:10 mysql_binlog.000102
-rw-r----- 1 mysql mysql 921675101 Apr 9 00:10 mysql_binlog.000103
-rw-r----- 1 mysql mysql 3854 Apr 9 00:10 mysql_binlog.index
-rw-r----- 1 mysql mysql 330341220 Apr 9 09:30 mysql_binlog.000104
检查三个节点(姑且用node1, node2, node3表示),发现其它两个节点(node1,node2)的binlog都正常清理了,而唯独节点node3的binlog一直没有被清理。三个节点的MySQL系统变量是一致的,如下所示,
mysql> show variables like 'binlog_expire_logs_seconds';
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| binlog_expire_logs_seconds | 1209600 |
+----------------------------+---------+
1 row in set (0.00 sec)
mysql> show variables like 'expire_logs_days';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| expire_logs_days | 0 |
+------------------+-------+
1 row in set (0.00 sec)
mysql>
binlog_expire_logs_seconds设置为1209600,也就是14天,超过14天的binlog应该被清理掉,但是检查发现node3上超过14天的binlog并没有被清理。那么为什么节点node3上的过期的binlog没有被清理呢?
首先,这三个节点的MySQL数据库版本是一致的,出现问题的节点node3是MySQL InnoDB集群的备库,我们在这个节点上做Xtrabackup备份。
其次,我们检查MySQL错误日志,其实每天作业做Xtrabackup备份时,都会收到类似下面告警邮件
2024-04-09T00:10:03.078285+08:00 357726 [Warning] [MY-013712] [Server] No suitable 'keyring_component_metadata_query' service implementation found to fulfill the request.
2024-04-09T00:10:12.066464+08:00 357726 [Warning] [MY-014054] [Server] Could not purge binary logs since another session is executing LOCK INSTANCE FOR BACKUP. Wait for that session to release the lock.
出现这个告警信息,之前也查过,官方文档提示,当对MySQL实例发出LOCK INSTANCE FOR BACKUP语句期间,PURGE BINARY LOGS是不允许的,因为它会从服务器中删除文件(删除binlog文件),违反了备份锁(backup lock)的规则。从MySQL 8.0.28开始,这个是不允许的。具体的官方文档[1]信息如下所示:
PURGE BINARY LOGS should not be issued while a LOCK INSTANCE FOR BACKUP statement is in effect for the instance, because it contravenes the rules of the backup lock by removing files from the server. From MySQL 8.0.28, this is disallowed.
因为Xtrabackup备份时,它会执行FLUSH NO_WRITE_TO_BINLOG BINARY LOGS命令,它会轮换二进制日志(rotate the binlog),这实质上将创建一个新的二进制日志,而在 轮换二进制日志时,就会触发PURGE BINARY LOGS,而由于当前MySQL版本为MySQL 8.0.35,在MySQL进行Xtrabackup时,由于LOCK INSTANCE FOR BACKUP锁,导致PURGE BINARY LOGS是不允许的。所以Xtrabackup期间,我们会收到这个告警,即使binlog出现了轮换,但是也不会触发清理过期的binlog。之前分析过这个告警,但是觉得它没有产生什么异常或问题,就忽略了这个告警。
另外,我们要弄清楚清理过期binlog的原理机制,即使我们设置了系统变量binlog_expire_logs_seconds,不是说一旦binlog的创建时间超过了系统变的binlog_expire_logs_seconds的阈值,就会立即触发MySQL的相关线程去清理过期的binlog。清理过期的binlog,一般发生在MySQL启动时或刷新二进制日志时进行。官方文档[2]的描述如下:
Binary log files are automatically removed after the server's binary log expiration period. Removal of the files can take place at startup and when the binary log is flushed.
我检查系统变量max_binlog_size的大小,如下所示,也就是1G,
mysql> show variables like 'max_binlog_size';
+-----------------+------------+
| Variable_name | Value |
+-----------------+------------+
| max_binlog_size | 1073741824 |
+-----------------+------------+
1 row in set (0.01 sec)
mysql>
由于这个系统一天产生的binlog小于1G(可以从上面的binlog的输出信息看出,例如,binlog的大小,binlog的创建时间),也就是说,一天24小时内,MySQL的binlog不会由于binlog过大而触发binlog的切换,也就是说只有Xtrabackup时才触发了binlog的切换,而又由于限制问题,导致过期的binlog不能被清理,久而久之,过期的binlog一直不能清理,直到触发空间告警,问题才被暴露出来。
而其它几套MySQL InnoDB Cluster也有这个问题,但是它们的过期的binlog为什么被清理了呢? 因为它们一天产生的binlog的大小,超过了系统变量max_binlog_size的大小,在其它时间点会轮换/切换binlog,从而触发了PURGE BINARY LOGS。所以就没有遇到这样的问题。
参考资料
1: https://dev.mysql.com/doc/refman/8.0/en/lock-instance-for-backup.html
[2]2: https://dev.mysql.com/doc/refman/8.0/en/purge-binary-logs.html