MHA有两个重要的检测工具,分别用来验证节点间主机通讯和主从关系健康情况

masterha_check_ssh
负责检测ssh通讯是否正常,并且会在masterha_manager和masterha_check_repl脚本启动的时候,被调用

Manager访问节点实例,需要通过ssh
Node从库节点间拷贝relay log,需要通过ssh
(这也就是为什么我们在安装时需要建立主机之间的信任关系)

环境

1
2
3
4
5
manager 10.20.64.209
node1 10.20.64.202
node2 10.20.64.203
node3 10.20.64.204
node4 10.20.64.210

这里我们使用了1主3从,共4个节点。网上很多是按3个节点来的,而实际的工作中,我们很可能为了负载读,会建立n个从库,那么就来的真实一点,看看超过3个节点的时候,MHA是如何工作的

PS:并且我们将Manager单分出去,也不同于官方或者网上选择某一个节点兼职Manager。也算规划的一部分吧,我们可以在这台Manager上面管理n组集群(当然,你也可以为它做HA)

ssh检测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
[mha@n-op-209 etc]$ masterha_check_ssh --conf=/usr/local/mha-manager/etc/test.conf
Wed Mar 23 15:06:32 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
这里感觉mha不像mysql那么聪明,虽然我们已经手动指定了conf文件的绝对路径,但它还是会去/etc下找默认的配置文件。忽略即可
Wed Mar 23 15:06:32 2016 - [info] Reading application default configuration from /usr/local/mha-manager/etc/test.conf..
这里它读取配置文件的[default]内容
Wed Mar 23 15:06:32 2016 - [info] Reading server configuration from /usr/local/mha-manager/etc/test.conf..
这里它读取配置文件的[server]内容
Wed Mar 23 15:06:32 2016 - [info] Starting SSH connection tests..
这里开始了排列组合式的ssh检测,我们有4个节点,那么就是n*(n-1)次检测,这里是12
Wed Mar 23 15:06:33 2016 - [debug]
Wed Mar 23 15:06:32 2016 - [debug] Connecting via SSH from mha@10.20.64.202(10.20.64.202:22) to mha@10.20.64.203(10.20.64.203:22)..
Wed Mar 23 15:06:32 2016 - [debug] ok.
(node1到node2)
Wed Mar 23 15:06:32 2016 - [debug] Connecting via SSH from mha@10.20.64.202(10.20.64.202:22) to mha@10.20.64.204(10.20.64.204:22)..
Wed Mar 23 15:06:33 2016 - [debug] ok.
(node1到node3)
Wed Mar 23 15:06:33 2016 - [debug] Connecting via SSH from mha@10.20.64.202(10.20.64.202:22) to mha@10.20.64.210(10.20.64.210:22)..
Wed Mar 23 15:06:33 2016 - [debug] ok.
(node1到node4)
---
Wed Mar 23 15:06:34 2016 - [debug]
Wed Mar 23 15:06:33 2016 - [debug] Connecting via SSH from mha@10.20.64.203(10.20.64.203:22) to mha@10.20.64.202(10.20.64.202:22)..
Wed Mar 23 15:06:33 2016 - [debug] ok.
(node2到node1)
Wed Mar 23 15:06:33 2016 - [debug] Connecting via SSH from mha@10.20.64.203(10.20.64.203:22) to mha@10.20.64.204(10.20.64.204:22)..
Wed Mar 23 15:06:33 2016 - [debug] ok.
(node2到node3)
Wed Mar 23 15:06:33 2016 - [debug] Connecting via SSH from mha@10.20.64.203(10.20.64.203:22) to mha@10.20.64.210(10.20.64.210:22)..
Wed Mar 23 15:06:33 2016 - [debug] ok.
(node2到node4)
---
Wed Mar 23 15:06:34 2016 - [debug]
Wed Mar 23 15:06:33 2016 - [debug] Connecting via SSH from mha@10.20.64.204(10.20.64.204:22) to mha@10.20.64.202(10.20.64.202:22)..
Wed Mar 23 15:06:33 2016 - [debug] ok.
(node3到node1)
Wed Mar 23 15:06:33 2016 - [debug] Connecting via SSH from mha@10.20.64.204(10.20.64.204:22) to mha@10.20.64.203(10.20.64.203:22)..
Wed Mar 23 15:06:34 2016 - [debug] ok.
(node3到node2)
Wed Mar 23 15:06:34 2016 - [debug] Connecting via SSH from mha@10.20.64.204(10.20.64.204:22) to mha@10.20.64.210(10.20.64.210:22)..
Wed Mar 23 15:06:34 2016 - [debug] ok.
(node3到node4)
---
Wed Mar 23 15:06:35 2016 - [debug]
Wed Mar 23 15:06:34 2016 - [debug] Connecting via SSH from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.202(10.20.64.202:22)..
Wed Mar 23 15:06:34 2016 - [debug] ok.
(node4到node1)
Wed Mar 23 15:06:34 2016 - [debug] Connecting via SSH from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.203(10.20.64.203:22)..
Wed Mar 23 15:06:34 2016 - [debug] ok.
(node4到node2)
Wed Mar 23 15:06:34 2016 - [debug] Connecting via SSH from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.204(10.20.64.204:22)..
Wed Mar 23 15:06:34 2016 - [debug] ok.
(node4到node3)
Wed Mar 23 15:06:35 2016 - [info] All SSH connection tests passed successfully.
ssh连接检测成功
至此检测完毕

下面我们破坏node2的信任,再来看看ssh_check是如何工作的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@n-op-203 .ssh]# pwd
/home/mha/.ssh
[root@n-op-203 .ssh]# ll
总用量 20
-rw------- 1 mha mysql 1648 3月 23 14:30 authorized_keys
-rw-r--r-- 1 mha mysql 152 3月 14 15:06 environment
-rw------- 1 mha mysql 1675 12月 25 18:08 id_rsa
-rw-r--r-- 1 mha mysql 412 12月 25 18:08 id_rsa.pub
-rw-r--r-- 1 mha mysql 1182 12月 29 16:06 known_hosts
这是已有的信任
[root@n-op-203 .ssh]# cat authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzMXm1ZwC/T9eFZJSFhMHzv0RcGX+KNC+kHmOCRlW6hpOGUs4scBR4ulz7G4TgZjl0HYkydnnHlF4DHmfq7STc4Kuf+xXX1qAiZNdRTyNVe/4atBtjPtL3SYsbmbZwqU6M5q7ssg1ZNMZ26Vnh7UVA0JsYxJIGF73lWVs0lqjUxbOGZ6SBlMzFi4NH82MMxjC/JidmTpDoVQhmzZ7TM+Lc7Gs3eTsv/9/cIgaU2sPTt6u4GgM/1FNFT67VsoQQ+yyZ2tTXLSfrIR2UYNtAGafoWvDWwrMBPWpzPjqTp7fAX3qtr3dEk53uXgsi87jjpjXsWM9bK092HUwistFyL2nUw== mha@n-op-202.corp.com
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAp6IGTW+geUV8l2DyThO3WkhA7oKtOtCfaSpoqLHbOCj2BHntX4opDC4zMkGjnKgUH4lE6K+VkkdkTJJvIqJXVFn6xFI0Vd6wLPKiCddvp1KnJb3Eqg/KdwxWzV+q5ReHU6N9J/B5eeRNnyJwhTqG0TuMUhoVVkS80lV4qiS2OpRPv2bohothMnufjlteeJGHqzgt84DTD2QUUYlfXwEv8V8o02pPR2s+vLOvXEv+LCzjOexFknkHmo2/DNhzHqG1QYHgmR/ku2WpJAnFZ3yRVzdahjq/pRKJ87XhN6ZZvNOuLHLuStYvlbq4pGuSUCIJmzn2slbRaVSQKMuuNgHy/Q== mha@n-op-203.corp.com
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEApW6fLJU0HecKi6mz5dYDg2xvGM5QjiNlFEL3PjgOwS9EShfWIDRuDnk1hybCpaWKdfp66DX6iVNDmdfDoQ7Oz+j+pxu9itetKn6eoKNJBsxKtw+LOkxqw+LE/o4iRM77BWDyLtrXGrKpAebK3DZn//d7RiKd8Can+/LJRsTAnYCHj+NPLsd3ELiDnsn/fJfJU81nYk5oPwMVpKepRXu02tLzOyy1Vw7K4mINnSDEMh+vG3iPna/xIS33oea3WIOwC8SFkZ38hzNCHpa85bvSoc5fU0vu0xyOV0irEO+jqGjBEqqcQ+Lg5zRDpkz+Ma1YtVIuR2rwu0kbLTRLrBHGjw== mha@n-op-204.corp.com
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAz+F+j1oehUpbDV9pNogjcHmZZQl0mJsp4pR+pixxnT4/dZGsqxzPzb1CJlxe1OFtPOk6+oD3yQ1kQfTtTmnyqhcZ+06F0stQtaOEc1aYaMGuFVNhvZK0S4JUtJ6liAJb9tQxPdtp1eNHKUBBIDreEM036U1lD/qbDlqRuSLM7DGzlRILu7dKQE3xiAb1VuQg9t/gejjiziivSQ2KVqF33T5wmFE+FSp5rHyArTS8NUCA4LIZmA0OSWykOLh4t5QxFne7kvzOM9zCCSNe+p1YmtuQBIDom5mH9hKouZGciZFWdeBAynAATYjCPPbDBSY30jlWfIkmsZ7wX8UPIyiAmQ== mha@n-op-209.corp.com
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAoWNfSuCiRcR4TUYi1m9Pqpd6v76/jvs+jX8MTgTRIjgz3JeKJ/NpQKJ7GeBSaEmKAg0b8iTNR61fOlV7WMLKo6YRWsuYwD5dIk/wFAZs1gYYsLcU9Dslbqnbz94EMz324sbNZm+05aVfPG0VEEGxjsSKgsl7JogQfk6rq+19cxPizsdBsdo4ZEI0ahbPkzRl4FvvN5NXZZ6sLhH1XmbcJxtGE9cNJiaxTEZe2Nexhfv+MBatDZSSosbHt+MvjwsNfngilSIe6C/fCcVdsdEoLt4B4ZOEbmyLhudnvMTMWWJrbMZzE2AHqIcjl3ctr4rJgPP7f7FEUxvlmVO694/HWQ== mha@n-op-210.corp.com
将其重命名
[root@n-op-203 .ssh]# mv authorized_keys authorized_keys.bak

再次check_ssh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
[mha@n-op-209 etc]$ masterha_check_ssh --conf=/usr/local/mha-manager/etc/test.conf
Wed Mar 23 15:09:56 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Mar 23 15:09:56 2016 - [info] Reading application default configuration from /usr/local/mha-manager/etc/test.conf..
Wed Mar 23 15:09:56 2016 - [info] Reading server configuration from /usr/local/mha-manager/etc/test.conf..
Wed Mar 23 15:09:56 2016 - [info] Starting SSH connection tests..
这里开始了排列组合式的ssh检测,我们有4个节点,那么就是n*(n-1)次检测,这里是12次
Wed Mar 23 15:09:56 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln63]
Wed Mar 23 15:09:56 2016 - [debug] Connecting via SSH from mha@10.20.64.202(10.20.64.202:22) to mha@10.20.64.203(10.20.64.203:22)..
Permission denied (publickey,password).
(node1到node2没有权限)
Wed Mar 23 15:09:56 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln111] SSH connection from mha@10.20.64.202(10.20.64.202:22) to mha@10.20.64.203(10.20.64.203:22) failed!
(node1到node2连接失败!)
注意:这里并没有检测node1到node3,也没有检测node1到node4,直接进入node2检测其他节点阶段
也就是说:当出现异常后,放弃该节点对剩余节点的所有检测,进入下一个节点连接检测
---
Wed Mar 23 15:09:57 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln63]
Wed Mar 23 15:09:56 2016 - [debug] Connecting via SSH from mha@10.20.64.203(10.20.64.203:22) to mha@10.20.64.202(10.20.64.202:22)..
Permission denied (publickey,password).
(node2到node1没有权限)
Wed Mar 23 15:09:56 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln111] SSH connection from mha@10.20.64.203(10.20.64.203:22) to mha@10.20.64.202(10.20.64.202:22) failed!
(node2到node3连接失败!)
---
Wed Mar 23 15:09:57 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln63]
下面成功了为什么还会报错?
Wed Mar 23 15:09:57 2016 - [debug] Connecting via SSH from mha@10.20.64.204(10.20.64.204:22) to mha@10.20.64.202(10.20.64.202:22)..
Wed Mar 23 15:09:57 2016 - [debug] ok.
(这里是第一次成功的:node3到node1)
所以上面那句报错,应该是属于这里的(我补:Wed Mar 23 15:09:57 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln63])
Wed Mar 23 15:09:57 2016 - [debug] Connecting via SSH from mha@10.20.64.204(10.20.64.204:22) to mha@10.20.64.203(10.20.64.203:22)..
Permission denied (publickey,password).
(node3到node2没有权限)
Wed Mar 23 15:09:57 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln111] SSH connection from mha@10.20.64.204(10.20.64.204:22) to mha@10.20.64.203(10.20.64.203:22) failed!
(node3到node2连接失败!)
---
Wed Mar 23 15:09:58 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln63]
下面成功了为什么还会报错?
Wed Mar 23 15:09:57 2016 - [debug] Connecting via SSH from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.202(10.20.64.202:22)..
Wed Mar 23 15:09:57 2016 - [debug] ok.
(这里是第二次成功的:node4到node1)
所以上面那句报错,应该是属于这里的(我补:Wed Mar 23 15:09:58 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln63])
Wed Mar 23 15:09:57 2016 - [debug] Connecting via SSH from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.203(10.20.64.203:22)..
Permission denied (publickey,password).
(node4到node2没有权限)
Wed Mar 23 15:09:58 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln111] SSH connection from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.203(10.20.64.203:22) failed!
(node4到node2连接失败!)
semi-panic: attempt to dup freed string at /usr/local/share/perl5/Carp.pm line 229.
SSH Configuration Check Failed!
at /usr/local/mha-manager/bin/masterha_check_ssh line 44.
ssh连接检测失败!
到此检测完毕

可以看到ssh检测是分两段式返回的

1
2
SSHCheck.pm, ln63: Permission denied (publickey,password).
SSHCheck.pm, ln111: SSH connection from X to Y failed!

我们来看看SSHCheck.pm, ln63和ln111在干嘛

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
1 #!/usr/bin/env perl
...
31 use Parallel::ForkManager;
...
47 my $pm = new Parallel::ForkManager( $#servers + 1 );
...
55 $pm->run_on_finish(
56 sub {
57 my ( $pid, $exit_code, $target ) = @_;
58 return if ( $target->{skip_init_ssh_check} );
59 my $local_file =
60 "$workdir/$target->{ssh_host}_$target->{ssh_port}_ssh_check.log";
61 if ($exit_code) {
62 $failed = 1;
63 if ( -f $local_file ) {
64 $log->error( "\n" . `cat $local_file` );
65 }
66 }
67 else {
68 if ( -f $local_file ) {
69 $log->debug( "\n" . `cat $local_file` );
70 }
71 }
72 unlink $local_file;
73 }
74 );
.
101 foreach my $dst (@servers) {
102 next if ( $dst->{skip_init_ssh_check} );
103 next if ( $src->{id} eq $dst->{id} );
104 $pplog->debug(
105 " Connecting via SSH from $src->{ssh_user}\@$src->{ssh_host}($src->{ssh_ip}:$src->{ssh_port}) to $dst->{ssh_user}\@$ds t->{ssh_host}($dst->{ssh_ip}:$dst->{ssh_port}).."
106 );
107 my $command =
108 "ssh $MHA::ManagerConst::SSH_OPT_CHECK -p $src->{ssh_port} $src->{ssh_user}\@$src->{ssh_ip} \"ssh $MHA::ManagerConst:: SSH_OPT_CHECK -p $dst->{ssh_port} $dst->{ssh_user}\@$dst->{ssh_ip} exit 0\"";
109 my ( $high, $low ) = MHA::ManagerUtil::exec_system( $command, $file );
110 if ( $high != 0 || $low != 0 ) {
111 $pplog->error(
112 "SSH connection from $src->{ssh_user}\@$src->{ssh_host}($src->{ssh_ip}:$src->{ssh_port}) to $dst->{ssh_user}\@$dst->{s sh_host}($dst->{ssh_ip}:$dst->{ssh_port}) failed!"
113 );
114 $pm->finish(1);
115 }
116 $pplog->debug(" ok.");
117 }
118 $pm->finish(0);
119 };

Parallel::ForkManager 这个模块是通过 Fork 进程而不是创建线程来实现并行处理

拼手速的时候到了,你也可以写个脚本监控下进程和中间输出

既然是fork进程,那我们就能抓取到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@n-op-209 MHA]# ps -ef|grep check
mha 11149 10757 19 16:15 pts/0 00:00:00 perl /usr/local/mha-manager/bin/masterha_check_ssh --conf=/usr/local/mha-manager/etc/test.conf
mha 11170 11149 0 16:15 pts/0 00:00:00 perl /usr/local/mha-manager/bin/masterha_check_ssh --conf=/usr/local/mha-manager/etc/test.conf
mha 11171 11170 0 16:15 pts/0 00:00:00 sh -c ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5 -p 22 mha@10.20.64.210 "ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5 -p 22 mha@10.20.64.202 exit 0" >> /tmp/10.20.64.210_22_ssh_check.log 2>&1
[root@n-op-209 MHA]# ps -ef|grep check
mha 11149 10757 9 16:15 pts/0 00:00:00 perl /usr/local/mha-manager/bin/masterha_check_ssh --conf=/usr/local/mha-manager/etc/test.conf
mha 11170 11149 0 16:15 pts/0 00:00:00 perl /usr/local/mha-manager/bin/masterha_check_ssh --conf=/usr/local/mha-manager/etc/test.conf
mha 11175 11170 0 16:15 pts/0 00:00:00 sh -c ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5 -p 22 mha@10.20.64.210 "ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o BatchMode=yes -o ConnectTimeout=5 -p 22 mha@10.20.64.203 exit 0" >> /tmp/10.20.64.210_22_ssh_check.log 2>&1

11149 是 masterha_check_ssh
11170 就是 11149的fork进程
11171和11175 都是 由11170再fork出来实际干活的

选项 说明
StrictHostKeyChecking=no SSH对主机的public_key的检查等级,那么连接主机时不会看到任何提示
PasswordAuthentication=no 无需鉴定密码
BatchMode=yes 连接中不跳出输入密码提示,连接不成功自动定义为连接失败
ConnectTimeout=5 连接超时5秒
>> /tmp/10.20.64.210_22_ssh_check.log 2>&1 将标准输出和错误输出,都写入临时日志文件

日志抽样

1
2
3
4
5
6
7
8
[root@n-op-209 tmp]# cat /tmp/10.20.64.210_22_ssh_check.log
Wed Mar 23 16:15:56 2016 - [debug] Connecting via SSH from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.202(10.20.64.202:22)..
[root@n-op-209 tmp]# cat /tmp/10.20.64.210_22_ssh_check.log
Wed Mar 23 16:15:56 2016 - [debug] Connecting via SSH from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.202(10.20.64.202:22)..
Wed Mar 23 16:15:57 2016 - [debug] ok.
Wed Mar 23 16:15:57 2016 - [debug] Connecting via SSH from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.203(10.20.64.203:22)..
Permission denied (publickey,password).
Wed Mar 23 16:15:57 2016 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln111] SSH connection from mha@10.20.64.210(10.20.64.210:22) to mha@10.20.64.203(10.20.64.203:22) failed!

日志内容很眼熟(SSHCheck.pm, ln111),内容为结论性判定。并且,它会将临时日志打回到最上层父进程。也就是说,我们启动masterha_manager后,这块内容会被写入/usr/local/mha-manager/logs/test.log

所以,Permission denied (publickey,password)为结论性判定,并阐述ssh连接失败的原因;SSH connection from X to Y failed!作为工作日志进行解释性描述

还有一个总结性的返回,阐述check失败!

1
2
3
4
5
semi-panic: attempt to dup freed string at /usr/local/share/perl5/Carp.pm line 229.
错误输入处理,Carp负责将产生错误的子命令行位置打印出来,也就是更清晰的说明报错信息
SSH Configuration Check Failed!
at /usr/local/mha-manager/bin/masterha_check_ssh line 44.

masterha_check_ssh line 44

1
44 exit MHA::SSHCheck::main(@ARGV);