MHA有两个重要的检测工具,分别用来验证节点间主机通讯和主从关系健康情况

masterha_check_repl
负责检测ssh通讯是否正常,并且会在masterha_manager和masterha_check_repl脚本启动的时候,被调用

Manager访问节点实例,需要通过ssh
Node从库节点间拷贝relay log,需要通过ssh
(这也就是为什么我们在安装时需要建立主机之间的信任关系)

环境

1
2
3
4
5
manager 10.20.64.209
node1 10.20.64.202
node2 10.20.64.203
node3 10.20.64.204
node4 10.20.64.210

这里我们使用了1主3从,共4个节点。网上很多是按3个节点来的,而实际的工作中,我们很可能为了负载读,会建立n个从库,那么就来的真实一点,看看超过3个节点的时候,MHA是如何工作的

PS:并且我们将Manager单分出去,也不同于官方或者网上选择某一个节点兼职Manager。也算规划的一部分吧,我们可以在这台Manager上面管理n组集群(当然,你也可以为它做HA)

repl检测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
[mha@n-op-209 etc]$ masterha_check_repl --conf=/usr/local/mha-manager/etc/test.conf
Fri Mar 25 12:13:19 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Fri Mar 25 12:13:19 2016 - [info] Reading application default configuration from /usr/local/mha-manager/etc/test.conf..
Fri Mar 25 12:13:19 2016 - [info] Reading server configuration from /usr/local/mha-manager/etc/test.conf..
Fri Mar 25 12:13:19 2016 - [info] MHA::MasterMonitor version 0.56.
MHA Monitor的版本信息
Fri Mar 25 12:13:20 2016 - [info] GTID failover mode = 0
MySQL的GTID设置,我们是关闭的
Fri Mar 25 12:13:20 2016 - [info] Dead Servers:
挂掉的主机列表(ok,这里没有)
Fri Mar 25 12:13:20 2016 - [info] Alive Servers:
Fri Mar 25 12:13:20 2016 - [info] 10.20.64.202(10.20.64.202:3306)
Fri Mar 25 12:13:20 2016 - [info] 10.20.64.203(10.20.64.203:3306)
Fri Mar 25 12:13:20 2016 - [info] 10.20.64.204(10.20.64.204:3306)
Fri Mar 25 12:13:20 2016 - [info] 10.20.64.210(10.20.64.210:3306)
存活的主机列表,是的没错,我们的4个节点都是活着的
Fri Mar 25 12:13:20 2016 - [info] Alive Slaves:
以下为从节点的详细信息
Fri Mar 25 12:13:20 2016 - [info] 10.20.64.203(10.20.64.203:3306) Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
检测binlog是否开启
Fri Mar 25 12:13:20 2016 - [info] Replicating from 10.20.64.202(10.20.64.202:3306)
检测自己是哪个节点的从库
Fri Mar 25 12:13:20 2016 - [info] Primary candidate for the new Master (candidate_master is set)
第一个新主候选者身份
Fri Mar 25 12:13:20 2016 - [info] 10.20.64.204(10.20.64.204:3306) Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
Fri Mar 25 12:13:20 2016 - [info] Replicating from 10.20.64.202(10.20.64.202:3306)
Fri Mar 25 12:13:20 2016 - [info] Primary candidate for the new Master (candidate_master is set)
第二个新主候选者身份
(与第一个同级别竞选新主,与检测顺序无关)
Fri Mar 25 12:13:20 2016 - [info] 10.20.64.210(10.20.64.210:3306) Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
Fri Mar 25 12:13:20 2016 - [info] Replicating from 10.20.64.202(10.20.64.202:3306)
Fri Mar 25 12:13:20 2016 - [info] Not candidate for the new Master (no_master is set)
非候选者身份,永远不会成为新主库
Fri Mar 25 12:13:20 2016 - [info] Current Alive Master: 10.20.64.202(10.20.64.202:3306)
识别当前主库
Fri Mar 25 12:13:20 2016 - [info] Checking slave configurations..
连接从库实例,检测从库配置
Fri Mar 25 12:13:20 2016 - [info] read_only=1 is not set on slave 10.20.64.203(10.20.64.203:3306).
这里会检测read_only参数,如果从库没有用这个参数控制外部写入,mha会打印一个info出来
(如果read_only=1,mha不会打印任何关于read only的信息)
Fri Mar 25 12:13:20 2016 - [warning] relay_log_purge=0 is not set on slave 10.20.64.203(10.20.64.203:3306).
这里还会检测relay_log_purge参数,如果从库不自动删除relay log,mha会打印一个warning
(如果relay_log_purge=0,mha不会打印任何关于relay log purge的信息)
Fri Mar 25 12:13:20 2016 - [info] read_only=1 is not set on slave 10.20.64.204(10.20.64.204:3306).
Fri Mar 25 12:13:20 2016 - [warning] relay_log_purge=0 is not set on slave 10.20.64.204(10.20.64.204:3306).
Fri Mar 25 12:13:20 2016 - [info] read_only=1 is not set on slave 10.20.64.210(10.20.64.210:3306).
Fri Mar 25 12:13:20 2016 - [warning] relay_log_purge=0 is not set on slave 10.20.64.210(10.20.64.210:3306).
(总结:如果所有从库read_only=1 and relay_log_purge=0,这块不会打印任何内容)
Fri Mar 25 12:13:20 2016 - [info] Checking replication filtering settings..
连接从库实例,检测从库同步过滤配置
Fri Mar 25 12:13:20 2016 - [info] binlog_do_db= , binlog_ignore_db=
Fri Mar 25 12:13:20 2016 - [info] Replication filtering check ok.
Fri Mar 25 12:13:20 2016 - [info] GTID (with auto-pos) is not supported
检测GTID,我们的实例没有使用GTID同步,所以这里显示不支持
Fri Mar 25 12:13:20 2016 - [info] Starting SSH connection tests..
Fri Mar 25 12:13:23 2016 - [info] All SSH connection tests passed successfully.
这里开始了排列组合式的ssh检测,这里只显示成功与否
(你也可以使用--skip_check_ssh跳过ssh检查)
Fri Mar 25 12:13:23 2016 - [info] Checking MHA Node version..
Fri Mar 25 12:13:24 2016 - [info] Version check ok.
检测MHA的节点版本
Fri Mar 25 12:13:24 2016 - [info] Checking SSH publickey authentication settings on the current master..
Fri Mar 25 12:13:24 2016 - [info] HealthCheck: SSH to 10.20.64.202 is reachable.
检测主库SSH通信
Fri Mar 25 12:13:24 2016 - [info] Master MHA Node version is 0.56.
显示主库MHA节点版本(这个打印顺序也是醉了)
Fri Mar 25 12:13:24 2016 - [info] Checking recovery script configurations on 10.20.64.202(10.20.64.202:3306)..
检测主库恢复脚本执行是否正常
Fri Mar 25 12:13:24 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysqldata/3306/binlog --output_file=/usr/local/mha-node/apps/test/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000001
使用节点脚本save_binary_logs,观察其工作是否正常,下面是脚本输出
Fri Mar 25 12:13:24 2016 - [info] Connecting to mha@10.20.64.202(10.20.64.202:22)..
Creating /usr/local/mha-node/apps/test if not exists.. ok.
检测应用项目目录是否存在,如果不存在MHA会自动创建
Checking output directory is accessible or not..
ok.
检测binlog输出存放目录访问情况
Binlog found at /data/mysqldata/3306/binlog, up to mysql-bin.000001
找到指定的binlog
Fri Mar 25 12:13:24 2016 - [info] Binlog setting check done.
到此binlog配置检测完成
Fri Mar 25 12:13:24 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
检测从库SSH通信,并且检测所有从库恢复脚本配置情况,下面的命令将逐个检测
Fri Mar 25 12:13:24 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha' --slave_host=10.20.64.203 --slave_ip=10.20.64.203 --slave_port=3306 --workdir=/usr/local/mha-node/apps/test --target_version=5.6.27-76.0-log --manager_version=0.56 --client_bindir=/usr/local/mysql/bin --client_libdir=/usr/local/mysql/lib --relay_dir=/data/mysqldata/3306/binlog --current_relay_log=relay-bin.000003 --slave_pass=xxx
使用节点脚本apply_diff_relay_logs,观察其工作是否正常,下面是脚本输出
Fri Mar 25 12:13:24 2016 - [info] Connecting to mha@10.20.64.203(10.20.64.203:22)..
Checking slave recovery environment settings..
Relay log found at /data/mysqldata/3306/binlog, up to relay-bin.000003
找到指定的relay log
Temporary relay log file is /data/mysqldata/3306/binlog/relay-bin.000003
目前临时的relay log文件名
Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
done.
测试连接和访问权限,警告:在命令行输入密码是不安全的
Testing mysqlbinlog output.. done.
使用mysqlbinlog命令测试relay log内容输出是否正常
Cleaning up test file(s).. done.
清理测试临时文件
(到此node1检测完毕)
以此类推
Fri Mar 25 12:13:25 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha' --slave_host=10.20.64.204 --slave_ip=10.20.64.204 --slave_port=3306 --workdir=/usr/local/mha-node/apps/test --target_version=5.6.27-76.0-log --manager_version=0.56 --client_bindir=/usr/local/mysql/bin --client_libdir=/usr/local/mysql/lib --relay_dir=/data/mysqldata/3306/binlog --current_relay_log=relay-bin.000002 --slave_pass=xxx
Fri Mar 25 12:13:25 2016 - [info] Connecting to mha@10.20.64.204(10.20.64.204:22)..
Checking slave recovery environment settings..
Relay log found at /data/mysqldata/3306/binlog, up to relay-bin.000002
Temporary relay log file is /data/mysqldata/3306/binlog/relay-bin.000002
Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
(到此node2检测完毕)
Fri Mar 25 12:13:25 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha' --slave_host=10.20.64.210 --slave_ip=10.20.64.210 --slave_port=3306 --workdir=/usr/local/mha-node/apps/test --target_version=5.6.27-76.0-log --manager_version=0.56 --client_bindir=/usr/local/mysql/bin --client_libdir=/usr/local/mysql/lib --relay_dir=/data/mysqldata/3306/binlog --current_relay_log=relay-bin.000002 --slave_pass=xxx
Fri Mar 25 12:13:25 2016 - [info] Connecting to mha@10.20.64.210(10.20.64.210:22)..
Creating directory /usr/local/mha-node/apps/test.. done.
Checking slave recovery environment settings..
Relay log found at /data/mysqldata/3306/binlog, up to relay-bin.000002
Temporary relay log file is /data/mysqldata/3306/binlog/relay-bin.000002
Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
(到此node3检测完毕)
Fri Mar 25 12:13:25 2016 - [info] Slaves settings check done.
到此从库binlog、relay log配置检测完毕
Fri Mar 25 12:13:25 2016 - [info]
10.20.64.202(10.20.64.202:3306) (current master) (test master)
+--10.20.64.203(10.20.64.203:3306) (test standby)
+--10.20.64.204(10.20.64.204:3306) (test standby)
+--10.20.64.210(10.20.64.210:3306) (test slave)
根据配置文件,以树形显示1个主库,2个候选人从库,1个非候选人从库
下面是从库同步检测
Fri Mar 25 12:13:25 2016 - [info] Checking replication health on 10.20.64.203..
Fri Mar 25 12:13:25 2016 - [info] ok.
(node1同步正常)
Fri Mar 25 12:13:25 2016 - [info] Checking replication health on 10.20.64.204..
Fri Mar 25 12:13:25 2016 - [info] ok.
(node2同步正常)
Fri Mar 25 12:13:25 2016 - [info] Checking replication health on 10.20.64.210..
Fri Mar 25 12:13:25 2016 - [info] ok.
(node3同步正常)
Fri Mar 25 12:13:25 2016 - [warning] master_ip_failover_script is not defined.
警告:未定义master_ip_failover_script脚本
(因为我们的配置中注意注释了这行,将来使用自定义的脚本替换原生脚本)
Fri Mar 25 12:13:25 2016 - [warning] shutdown_script is not defined.
警告:未定义shutdown_script脚本
(因为我们的配置中注意注释了这行,将来使用自定义的脚本替换原生脚本)
Fri Mar 25 12:13:25 2016 - [info] Got exit code 0 (Not master dead).
本脚本赶回0值(主库没挂)
(PS:原生脚本打印很简单易懂,0值表示无错误返回)
MySQL Replication Health is OK.
同步很健康
至此检测完毕

下面我们停掉node2的同步,再来看看check_repl是如何工作的

1
2
3
4
5
6
当前同步状态
yes,yes
停止同步
stop slave;
no,no

再次check_repl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
[mha@n-op-209 etc]$ masterha_check_repl --conf=/usr/local/mha-manager/etc/test.conf
Tue Apr 5 12:32:02 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Apr 5 12:32:02 2016 - [info] Reading application default configuration from /usr/local/mha-manager/etc/test.conf..
Tue Apr 5 12:32:02 2016 - [info] Reading server configuration from /usr/local/mha-manager/etc/test.conf..
Tue Apr 5 12:32:02 2016 - [info] MHA::MasterMonitor version 0.56.
Tue Apr 5 12:32:03 2016 - [warning] SQL Thread is stopped(no error) on 10.20.64.203(10.20.64.203:3306)
第一时间警告:node1的SQL Thread线程停止(且slave status没有报错内容)
...
以下信息同上测检测,直到
...
Tue Apr 5 12:32:08 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha' --slave_host=10.20.64.203 --slave_ip=10.20.64.203 --slave_port=3306 --workdir=/usr/local/mha-node/apps/test --target_version=5.6.27-76.0-log --manager_version=0.56 --client_bindir=/usr/local/mysql/bin --client_libdir=/usr/local/mysql/lib --relay_dir=/data/mysqldata/3306/binlog --current_relay_log=relay-bin.000019 --slave_pass=xxx
Tue Apr 5 12:32:08 2016 - [info] Connecting to mha@10.20.64.203(10.20.64.203:22)..
Checking slave recovery environment settings..
Relay log found at /data/mysqldata/3306/binlog, up to relay-bin.000019
Temporary relay log file is /data/mysqldata/3306/binlog/relay-bin.000019
Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
无报错,仅打印当起从库relay log信息
...
其他信息同上次检测,直到
...
Tue Apr 5 12:32:09 2016 - [info] Slaves settings check done.
Tue Apr 5 12:32:09 2016 - [info]
10.20.64.202(10.20.64.202:3306) (current master) (test master)
+--10.20.64.203(10.20.64.203:3306) (test standby)
+--10.20.64.204(10.20.64.204:3306) (test standby)
+--10.20.64.210(10.20.64.210:3306) (test slave)
这里仅打印MHA配置信息,不要混淆
下面是从库同步检测
Tue Apr 5 12:32:09 2016 - [info] Checking replication health on 10.20.64.203..
Tue Apr 5 12:32:09 2016 - [error][/usr/local/share/perl5/MHA/Server.pm, ln485] Slave IO thread is not running on 10.20.64.203(10.20.64.203:3306)
检测到IO线程停止工作
Tue Apr 5 12:32:09 2016 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln1526] failed!
仅打印:失败!
Tue Apr 5 12:32:09 2016 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations. at /usr/local/share/perl5/MHA/MasterMonitor.pm line 417.
在检测配置时出现报错
Tue Apr 5 12:32:09 2016 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
监控服务器时出现报错
Tue Apr 5 12:32:09 2016 - [info] Got exit code 1 (Not master dead).
本脚本赶回1值(主库没挂)
(PS:1值表示错误退出)
这里我们看到,当出现一个同步异常,后续的从库同步检查也不做了
MySQL Replication Health is NOT OK!
同步不健康
到此检测完毕

可以看到repl检测是分两段式返回的

1
2
3
Server.pm, ln485: Slave IO thread is not running on 10.20.64.203(10.20.64.203:3306)
ServerManager.pmln1526: failed!

ssh检测那篇我们提到,它是通过fork进程来完成的,现在我们来看看repl检测是如何实现的

1
2
3
4
MasterMonitor.pm, ln424: Error happened on checking configurations.
MasterMonitor.pm line 417
MasterMonitor.pm, ln523: Error happened on monitoring servers.

这是一些总结性报错的可忽略

我们来看看Server.pm, ln485和ServerManager.pm, ln1526在干嘛

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Server.pm
1 #!/usr/bin/env perl
...
30 use MHA::DBHelper;
...
66 sub check_slave_status($) {
67 my $self = shift;
68 my $dbhelper = $self->{dbhelper};
69 return $dbhelper->check_slave_status();
70 }
...
471 #Check whether slave is running and not delayed
472 sub has_replication_problem {
473 my $self = shift;
474 my $allow_delay_seconds = shift;
475 $allow_delay_seconds = 1 unless ($allow_delay_seconds);
476 my $log = $self->{logger};
477 my $dbhelper = $self->{dbhelper};
478 my %status = $dbhelper->check_slave_status();
479 if ( $status{Status} ne '0' ) {
480 $log->error(
481 sprintf( "Getting slave status failed on %s", $self->get_hostinfo() ) );
482 return 1;
483 }
484 elsif ( $status{Slave_IO_Running} ne "Yes" ) {
485 $log->error(
486 sprintf( "Slave IO thread is not running on %s", $self->get_hostinfo() )
487 );
488 return 2;
489 }
490 elsif ( $status{Slave_SQL_Running} ne "Yes" ) {
491 $log->error(
492 sprintf( "Slave SQL thread is not running on %s", $self->get_hostinfo() )
493 );
494 return 3;
495 }
496 elsif ( $status{Seconds_Behind_Master}
497 && $status{Seconds_Behind_Master} > $allow_delay_seconds )
498 {
499 $log->error(
500 sprintf(
501 "Slave is currently behind %d seconds on %s",
502 $status{Seconds_Behind_Master},
503 $self->get_hostinfo()
504 )
505 );
506 return 4;
507 }
508 elsif ( !defined( $status{Seconds_Behind_Master} ) ) {
509 $log->error(
510 sprintf( "Failed to get Seconds_Behind_Master on %s",
511 $self->get_hostinfo() )
512 );
513 return 5;
514 }
515 return 0;
516 }

MHA::DBHelper MHA通过这个模块,获取到从库的状态信息,并将结果打印出来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
ServerManager.pm
1 #!/usr/bin/env perl
...
31 use Parallel::ForkManager;
...
90 sub get_alive_slaves($) {
91 my $self = shift;
92 return @{ $self->{alive_slaves} };
93 }
...
260 my $connection_checker = new Parallel::ForkManager( $#servers + 1 );
...
1513 sub check_replication_health {
1514 my $self = shift;
1515 my $allow_delay_seconds = shift;
1516 $allow_delay_seconds = 1 unless ($allow_delay_seconds);
1517 my $log = $self->{logger};
1518 my @alive_slaves = $self->get_alive_slaves();
1519 foreach my $target (@alive_slaves) {
1520 $log->info("Checking replication health on $target->{hostname}..");
1521 if ( !$target->current_slave_position() ) {
1522 $log->error("Getting slave status failed!");
1523 croak;
1524 }
1525 if ( $target->has_replication_problem($allow_delay_seconds) ) {
1526 $log->error(" failed!");
1527 croak;
1528 }
1529 else {
1530 $log->info(" ok.");
1531 }
1532 }
1533 }

Parallel::ForkManager 这个模块是通过 Fork 进程而不是创建线程来实现并行处理

这里同ssh检测一样,会Fork进程来检测同步健康

下面我们再次跟踪抓去下脚本运行时的中间过程

1
2
3
4
5
6
7
8
9
10
11
12
13
[mha@n-op-209 root]$ ps -ef|grep check
mha 25696 25324 15 15:32 pts/0 00:00:00 perl /usr/local/mha-manager/bin/masterha_check_repl --conf=/usr/local/mha-manager/etc/test.conf
mha 25706 25696 0 15:32 pts/0 00:00:00 perl /usr/local/mha-manager/bin/masterha_check_repl --conf=/usr/local/mha-manager/etc/test.conf
mha 25709 25706 0 15:32 pts/0 00:00:00 sh -c ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5 -p 22 mha@10.20.64.202 "ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5 -p 22 mha@10.20.64.204 exit 0" >> /usr/local/mha-manager/apps/test/10.20.64.202_22_ssh_check.log 2>&1
[mha@n-op-209 root]$ ps -ef|grep check
mha 25696 25324 15 15:32 pts/0 00:00:00 perl /usr/local/mha-manager/bin/masterha_check_repl --conf=/usr/local/mha-manager/etc/test.conf
mha 25715 25696 0 15:32 pts/0 00:00:00 perl /usr/local/mha-manager/bin/masterha_check_repl --conf=/usr/local/mha-manager/etc/test.conf
mha 25718 25715 0 15:32 pts/0 00:00:00 sh -c ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5 -p 22 mha@10.20.64.203 "ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5 -p 22 mha@10.20.64.204 exit 0" >> /usr/local/mha-manager/apps/test/10.20.64.203_22_ssh_check.log 2>&1

25696 是 masterha_check_repl
25706和25715 就是 25696的fork进程
(这里区别与check ssh,check ssh始终由一个fork进程来管理子进程;而repl的每一次从库检测,都会启动一个新fork,用完关闭)
25709和25718 分别是 由25706和25715再fork出来实际干活的

厄。。没错,竟然都是ssh的检测

参数说明和日志抽样参考上一篇check_ssh吧

那repl check在做什么?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mysql> select * from PROCESSLIST where USER='mha'\G
*************************** 1. row ***************************
ID: 141
USER: mha
HOST: 10.20.64.209:36199
DB: NULL
COMMAND: Sleep
TIME: 5
STATE:
INFO: NULL
TIME_MS: 5161
ROWS_SENT: 0
ROWS_EXAMINED: 0
TID: 15498
1 row in set (0.00 sec)

MHA在mysql中sleep了5秒,TID: 15498
TID

1
2
3
[root@n-op-210 ~]# ps -efL |grep 15498
mysql 12773 11617 15498 0 26 Mar25 ? 00:00:00 /data/percona-mysql5627/bin/mysqld --defaults-file=/data/mysqldata/3306/etc/my.cnf --basedir=/usr/local/mysql --datadir=/data/mysqldata/3306/data --plugin-dir=/usr/local/mysql/lib/mysql/plugin --user=mysql --log-error=/data/mysqldata/3306/mysql-error.log --open-files-limit=65535 --pid-file=/data/mysqldata/3306/mysql.pid --socket=/data/mysqldata/3306/mysql.sock --port=3306

原来这个TID就是node4的mysqld本身产生的,并且每次运行mha-check-repl,这个TID都不变
(是不是看的有点晕,首先我的测试机把Thread pool打开了,所以会有这些)

我们打开general_log看看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Time Id Command Argument
11:13:28 140 Connect mha@10.20.64.209 on
140 Query set autocommit=1
140 Query SELECT CONNECTION_ID() AS Value
11:13:29 141 Connect mha@10.20.64.209 on
141 Query set autocommit=1
141 Query SELECT CONNECTION_ID() AS Value
141 Query SET wait_timeout=86400
141 Query SELECT @@global.server_id As Value
141 Query SELECT VERSION() AS Value
141 Query SELECT @@global.gtid_mode As Value
141 Query SHOW GLOBAL VARIABLES LIKE 'log_bin'
141 Query SHOW MASTER STATUS
141 Query SELECT @@global.datadir AS Value
141 Query SELECT @@global.slave_parallel_workers AS Value
141 Query SHOW SLAVE STATUS
141 Query SELECT @@global.read_only As Value
141 Query SELECT @@global.relay_log_purge As Value
141 Query SELECT @@global.relay_log_info_repository AS Value
141 Query SELECT Relay_log_name FROM mysql.slave_relay_log_info
141 Query SELECT @@global.datadir AS Value
141 Query SHOW SLAVE STATUS
以上都是mha manager获取状态信息,我们可以从check_repl的日志中找到对应的内容
11:13:34 136 Query select * from PROCESSLIST where USER='mha'
这些是由node脚本apply_diff_relay_logs完成
142 Connect mha@10.20.64.210 on
142 Query select @@version_comment limit 1
142 Query set sql_log_bin=0
禁止此sessionbinlog,它下面做的事情,不会写入binlog
142 Query create table if not exists mysql.apply_diff_relay_logs_test(id int)
142 Query insert into mysql.apply_diff_relay_logs_test values(1)
142 Query update mysql.apply_diff_relay_logs_test set id=id+1 where id=1
142 Query delete from mysql.apply_diff_relay_logs_test
142 Query drop table mysql.apply_diff_relay_logs_test
142 Quit
在mysql库下建立测试表apply_diff_relay_logs_test,进行insert,update,delete,drop操作
这是在测试实例可用性,退出
141 Query SHOW SLAVE STATUS
141 Query SHOW SLAVE STATUS
141 Quit
然后再次本地获取下同步状态,退出

从这里我们可以看出,fork出的进程都是ssh,它是跳到node服务器,执行的node本地脚本

check repl做的事情相比check ssh复杂好多

下面我们需要了解的就是node节点的save_binary_logs和apply_diff_relay_logs两个脚本的工作了,放到下一篇吧

从check_repl脚本看来,它默认做了两项检测,ssh通信和主从同步。
按照官方文档,我们可以忽略ssh检测(--skip_check_ssh)
也可以设置延迟阈值来控制报错(--seconds_behind_master=(seconds))

另外,我们再次使用check_ssh博客的破坏信任的方式测试过,报错日志与check_ssh的一致,这里不再赘述

PS:当检测出现任何异常,check_ssh和check_repl都会exit 1报错退出,不再继续后面node的检测工作

小技巧:在部署完MHA后,官方建议我们使用check_ssh和check_repl先人工检测下。根据上面的经验,我们可以跳过check_ssh,直接使用check_repl来检测,因为它默认也把check_ssh的工作完成了:)