MHA故障转移之masterha_master_switch | 喀秋莎的技术博客

MHA有两个故障转移工具，分别支持手动和自动切换

两个工具各有用武之地，如果你想高可用快速failover，那么自动切换是一个不错的选择；
如果需要一些维护性切换，那么手动切换可以满足，并且做成平台自动化的话，可以调用手动切换脚本

masterha_master_switch
既然是手动切换，那么不管master是好是坏，我们想切就切：）

这个时候，我们可以告诉master_switch，master到底是好的还是坏的，并且可以指定新主

环境

manager 10.20.64.209
node1	10.20.64.202 master
node2	10.20.64.203 candidate
node3	10.20.64.204 candidate
node4	10.20.64.210 no master
mysql> show slave hosts;
+-----------+------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID                           |
+-----------+------+------+-----------+--------------------------------------+
|       203 |      | 3306 |       202 | 1ecea136-e9c5-11e5-81e7-005056ac51d6 |
|       204 |      | 3306 |       202 | 1f21f462-e9c5-11e5-81e7-005056ac02c4 |
|       210 |      | 3306 |       202 | c34b6df4-f0c3-11e5-af82-005056ac361f |
+-----------+------+------+-----------+--------------------------------------+
3 rows in set (0.00 sec)

switch切换
本次测试为主从静态切换，所以不存在丢数据的情况，我们的目的是了解下switch的切换机制（后续我们会在进行动态切换后数据一致性的测试）

手动切换会有一些人机交互的内容，如果你懒得回答那些问题，只希望控制是否切换，而剩下的工作全权交给mha去做的话，可以加上--interactive=0这个参数

这里有几个参数值得说一下

对于维护性切换，比如升级个硬件啥的，这几个参数可以这样结合的使用，省时省力

参数	说明
interactive=0	无需交互，按照给定的选项自动执行
new_master_host=10.20.64.204	指定新主
orig_master_is_new_slave	它可以在alive模式切换后，将老主库与新主建立同步，华丽变身为从库（我们可以随意下线维护了：））
remove_orig_master_conf	从配置文件清理老主库信息

[mha@n-op-209 etc]$ masterha_master_switch --conf=/usr/local/mha-manager/etc/test.conf --master_state=dead --dead_master_host=10.20.64.202
Wed Apr  6 17:03:12 2016 - [info] Dead Servers:
Wed Apr  6 17:03:12 2016 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln187] None of server is dead. Stop failover.
Wed Apr  6 17:03:12 2016 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/mha-manager/bin/masterha_master_switch line 53.

如果你的master并没有坏，会停止切换
所以，我们改用alive的方式

[mha@n-op-209 etc]$ masterha_master_switch --conf=/usr/local/mha-manager/etc/test.conf --master_state=alive
Thu Apr  7 14:04:23 2016 - [info] MHA::MasterRotate version 0.56.
Thu Apr  7 14:04:23 2016 - [info] Starting online master switch..
Thu Apr  7 14:04:23 2016 - [info]
Thu Apr  7 14:04:23 2016 - [info] * Phase 1: Configuration Check Phase..
Thu Apr  7 14:04:23 2016 - [info]
Thu Apr  7 14:04:23 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Apr  7 14:04:23 2016 - [info] Reading application default configuration from /usr/local/mha-manager/etc/test.conf..
Thu Apr  7 14:04:23 2016 - [info] Reading server configuration from /usr/local/mha-manager/etc/test.conf..
读取配置信息后，开始检测
Thu Apr  7 14:04:24 2016 - [info] GTID failover mode = 0
Thu Apr  7 14:04:24 2016 - [info] Current Alive Master: 10.20.64.202(10.20.64.202:3306)
Thu Apr  7 14:04:24 2016 - [info] Alive Slaves:
Thu Apr  7 14:04:24 2016 - [info]   10.20.64.203(10.20.64.203:3306)  Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
Thu Apr  7 14:04:24 2016 - [info]     Replicating from 10.20.64.202(10.20.64.202:3306)
Thu Apr  7 14:04:24 2016 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Apr  7 14:04:24 2016 - [info]   10.20.64.204(10.20.64.204:3306)  Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
Thu Apr  7 14:04:24 2016 - [info]     Replicating from 10.20.64.202(10.20.64.202:3306)
Thu Apr  7 14:04:24 2016 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Apr  7 14:04:24 2016 - [info]   10.20.64.210(10.20.64.210:3306)  Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
Thu Apr  7 14:04:24 2016 - [info]     Replicating from 10.20.64.202(10.20.64.202:3306)
Thu Apr  7 14:04:24 2016 - [info]     Not candidate for the new Master (no_master is set)
这段是由masterha_check_repl输出的
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 10.20.64.202(10.20.64.202:3306)? (YES/no): yes
Thu Apr  7 14:04:28 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Thu Apr  7 14:04:28 2016 - [info]  ok.
是否允许主库本地执行Flush Tables，以下执行语句不会写入binlog，所以不会被同步到从库
回答：yes
Thu Apr  7 14:04:28 2016 - [info] Checking MHA is not monitoring or doing failover..
Thu Apr  7 14:04:28 2016 - [info] Checking replication health on 10.20.64.203..
Thu Apr  7 14:04:28 2016 - [info]  ok.
Thu Apr  7 14:04:28 2016 - [info] Checking replication health on 10.20.64.204..
Thu Apr  7 14:04:28 2016 - [info]  ok.
Thu Apr  7 14:04:28 2016 - [info] Checking replication health on 10.20.64.210..
Thu Apr  7 14:04:28 2016 - [info]  ok.
逐个检查从库同步健康情况
Thu Apr  7 14:04:28 2016 - [info] Searching new master from slaves..
Thu Apr  7 14:04:28 2016 - [info]  Candidate masters from the configuration file:
从配置文件读取候选人信息
Thu Apr  7 14:04:28 2016 - [info]   10.20.64.202(10.20.64.202:3306)  Version=5.6.27-76.0-log log-bin:enabled
Thu Apr  7 14:04:28 2016 - [info]   10.20.64.203(10.20.64.203:3306)  Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
Thu Apr  7 14:04:28 2016 - [info]     Replicating from 10.20.64.202(10.20.64.202:3306)
Thu Apr  7 14:04:28 2016 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Apr  7 14:04:28 2016 - [info]   10.20.64.204(10.20.64.204:3306)  Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
Thu Apr  7 14:04:28 2016 - [info]     Replicating from 10.20.64.202(10.20.64.202:3306)
Thu Apr  7 14:04:28 2016 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Apr  7 14:04:28 2016 - [info]  Non-candidate masters:
Thu Apr  7 14:04:28 2016 - [info]   10.20.64.210(10.20.64.210:3306)  Version=5.6.27-76.0-log (oldest major version between slaves) log-bin:enabled
Thu Apr  7 14:04:28 2016 - [info]     Replicating from 10.20.64.202(10.20.64.202:3306)
Thu Apr  7 14:04:28 2016 - [info]     Not candidate for the new Master (no_master is set)
这段是由masterha_check_repl输出的
Thu Apr  7 14:04:28 2016 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
从候选人中寻找relay log中记录的事件最新的从库
Thu Apr  7 14:04:28 2016 - [info]
From:
10.20.64.202(10.20.64.202:3306) (current master) (test master)
 +--10.20.64.203(10.20.64.203:3306) (test standby)
 +--10.20.64.204(10.20.64.204:3306) (test standby)
 +--10.20.64.210(10.20.64.210:3306) (test slave)
描述下当前架构
To:
10.20.64.203(10.20.64.203:3306) (new master) (test standby)
 +--10.20.64.204(10.20.64.204:3306) (test standby)
 +--10.20.64.210(10.20.64.210:3306) (test slave)
mha甄别后推荐的新架构
Starting master switch from 10.20.64.202(10.20.64.202:3306) to 10.20.64.203(10.20.64.203:3306)? (yes/NO): no
mha：是否使用它推荐的203作为新主
答：no
（这里主要是想了解下，它还会问啥：）） 
Continue? (yes/NO): yes
mha：是否还要继续
答：yes
Enter new master host name: 10.20.64.204
mha：输入我希望的新主
Master switch to 10.20.64.204(10.20.64.204:3306). OK? (yes/NO): yes
mha：再次问我确定选择这个新主么？
答：yes
Thu Apr  7 14:04:37 2016 - [info] Checking whether 10.20.64.204(10.20.64.204:3306) is ok for the new master..
Thu Apr  7 14:04:37 2016 - [info]  ok.
开始检测新主
Thu Apr  7 14:04:37 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Apr  7 14:04:37 2016 - [info]
第一阶段：检测配置
Thu Apr  7 14:04:37 2016 - [info] * Phase 2: Rejecting updates Phase..
Thu Apr  7 14:04:37 2016 - [info]
第二阶段：拒接更新
master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes
首先交代下我们没有自定义切换脚本
其次，从这里我们可以知道，如果我们不手动控制老主库的写操作，mha是不会做任何处理的。也就是说，在切换的过程中，在应用写指向没有改到新主的时候（或者高可用，路由指向还没有被更新到新主的时候），写操作会持续不断的写入老主库
（这个环节有很多值得我们思考的地方，所以建议DBA编写自定义failover脚本，处理老库写操作问题）
Thu Apr  7 14:04:39 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
将老主库锁全表（包括mysql root用户更新操作）
Thu Apr  7 14:04:39 2016 - [info] Executing FLUSH TABLES WITH READ LOCK..
Thu Apr  7 14:04:39 2016 - [info]  ok.
Thu Apr  7 14:04:39 2016 - [info] Orig master binlog:pos is mysql-bin.000001:120.
Thu Apr  7 14:04:39 2016 - [info]  Waiting to execute all relay logs on 10.20.64.204(10.20.64.204:3306)..
这里加了一个读锁，只允许读，不允许写，并获取老主库binlog和pos点
Thu Apr  7 14:04:39 2016 - [info]  master_pos_wait(mysql-bin.000001:120) completed on 10.20.64.204(10.20.64.204:3306). Executed 0 events.
Thu Apr  7 14:04:39 2016 - [info]   done.
等待新主执行完所有的relay log
Thu Apr  7 14:04:39 2016 - [info] Getting new master's binlog name and position..
Thu Apr  7 14:04:39 2016 - [info]  mysql-bin.000004:120
获取新主的binlog和pos点
Thu Apr  7 14:04:39 2016 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.20.64.204', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000004', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx';
所有的其它从库，重新与新主建立同步
Thu Apr  7 14:04:39 2016 - [info] Setting read_only=0 on 10.20.64.204(10.20.64.204:3306)..
以前我们的候选人是读库，所以read_only=1
现在它被切换为新主，需要支持写操作，mha在这里设置read_only=0
Thu Apr  7 14:04:39 2016 - [info]  ok.
Thu Apr  7 14:04:39 2016 - [info]
Thu Apr  7 14:04:39 2016 - [info] * Switching slaves in parallel..
Thu Apr  7 14:04:39 2016 - [info]
Thu Apr  7 14:04:39 2016 - [info] -- Slave switch on host 10.20.64.203(10.20.64.203:3306) started, pid: 30094
Thu Apr  7 14:04:39 2016 - [info]
(node2)
Thu Apr  7 14:04:39 2016 - [info] -- Slave switch on host 10.20.64.210(10.20.64.210:3306) started, pid: 30095
Thu Apr  7 14:04:39 2016 - [info]
(node4)
Thu Apr  7 14:04:40 2016 - [info] Log messages from 10.20.64.203 ...
Thu Apr  7 14:04:40 2016 - [info]
Thu Apr  7 14:04:39 2016 - [info]  Waiting to execute all relay logs on 10.20.64.203(10.20.64.203:3306)..
Thu Apr  7 14:04:39 2016 - [info]  master_pos_wait(mysql-bin.000001:120) completed on 10.20.64.203(10.20.64.203:3306). Executed 0 events.
Thu Apr  7 14:04:39 2016 - [info]   done.
首先执行完积压的relay log
Thu Apr  7 14:04:39 2016 - [info]  Resetting slave 10.20.64.203(10.20.64.203:3306) and starting replication from the new master 10.20.64.204(10.20.64.204:3306)..
Thu Apr  7 14:04:39 2016 - [info]  Executed CHANGE MASTER.
Thu Apr  7 14:04:39 2016 - [info]  Slave started.
Thu Apr  7 14:04:40 2016 - [info] End of log messages from 10.20.64.203 ...
Thu Apr  7 14:04:40 2016 - [info]
Thu Apr  7 14:04:40 2016 - [info] -- Slave switch on host 10.20.64.203(10.20.64.203:3306) succeeded.
然后重置同步，建立与新主的同步，切同步成功
这段是由node本地脚本apply_diff_relay_logs输出的
Thu Apr  7 14:04:40 2016 - [info] Log messages from 10.20.64.210 ...
Thu Apr  7 14:04:40 2016 - [info]
Thu Apr  7 14:04:39 2016 - [info]  Waiting to execute all relay logs on 10.20.64.210(10.20.64.210:3306)..
Thu Apr  7 14:04:39 2016 - [info]  master_pos_wait(mysql-bin.000001:120) completed on 10.20.64.210(10.20.64.210:3306). Executed 0 events.
Thu Apr  7 14:04:39 2016 - [info]   done.
Thu Apr  7 14:04:39 2016 - [info]  Resetting slave 10.20.64.210(10.20.64.210:3306) and starting replication from the new master 10.20.64.204(10.20.64.204:3306)..
Thu Apr  7 14:04:39 2016 - [info]  Executed CHANGE MASTER.
Thu Apr  7 14:04:39 2016 - [info]  Slave started.
Thu Apr  7 14:04:40 2016 - [info] End of log messages from 10.20.64.210 ...
Thu Apr  7 14:04:40 2016 - [info]
Thu Apr  7 14:04:40 2016 - [info] -- Slave switch on host 10.20.64.210(10.20.64.210:3306) succeeded.
到这里，所有从库与新主同步已建立完成，切同步成功
Thu Apr  7 14:04:40 2016 - [info] Unlocking all tables on the orig master:
Thu Apr  7 14:04:40 2016 - [info] Executing UNLOCK TABLES..
Thu Apr  7 14:04:40 2016 - [info]  ok.
解锁老库读锁
（再次提起，应用指向未改的情况下，老库的写操作还会继续，且不会被同步到从库，数据不一致就是在这个环节产生的，有可能丢失1秒钟的数据。所以，最好在自定义脚本中先处理下这个问题）
Thu Apr  7 14:04:40 2016 - [info] All new slave servers switched successfully.
Thu Apr  7 14:04:40 2016 - [info]
所有从库已经切换成功
Thu Apr  7 14:04:40 2016 - [info] * Phase 5: New master cleanup phase..
Thu Apr  7 14:04:40 2016 - [info]
第五阶段：清理新主（我们show slave hosts内容为空就对了，不过根据我的观察，这个结果有延迟）
Thu Apr  7 14:04:40 2016 - [info]  10.20.64.204: Resetting slave info succeeded.
重置新主的slave状态（我们show slave status\G内容为空就对了）
Thu Apr  7 14:04:40 2016 - [info] Switching master to 10.20.64.204(10.20.64.204:3306) completed successfully.
至此切换完毕

我们来看mysql
旧主: 10.20.64.202

1 2	mysql> show slave hosts; Empty set (0.00 sec)

新主: 10.20.64.204

mmysql> show slave hosts;
+-----------+------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID                           |
+-----------+------+------+-----------+--------------------------------------+
|       203 |      | 3306 |       204 | 1ecea136-e9c5-11e5-81e7-005056ac51d6 |
|       210 |      | 3306 |       204 | c34b6df4-f0c3-11e5-af82-005056ac361f |
+-----------+------+------+-----------+--------------------------------------+
2 rows in set (0.00 sec)

新主binlog

mysql> show master status\G
*************************** 1. row ***************************
             File: mysql-bin.000004
         Position: 120
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set:
1 row in set (0.00 sec)

新的主从同步抽样

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.20.64.204
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 120
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 283
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 120
              Relay_Log_Space: 450
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 204
                  Master_UUID: 1f21f462-e9c5-11e5-81e7-005056ac02c4
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0.00 sec)

我们来看看切换的过程中，mha对mysql做了哪些操作

老主库

    Time Id Command Argument
14:04:23 37 Connect mha@10.20.64.209 on
         37 Query   set autocommit=1
         37 Query   SELECT CONNECTION_ID() AS Value
14:04:24 38 Connect mha@10.20.64.209 on
         38 Query   set autocommit=1
         38 Query   SELECT CONNECTION_ID() AS Value
         38 Query   SET wait_timeout=86400
         38 Query   SELECT @@global.server_id As Value
         38 Query   SELECT VERSION() AS Value
         38 Query   SELECT @@global.gtid_mode As Value
         38 Query   SHOW GLOBAL VARIABLES LIKE 'log_bin'
         38 Query   SHOW MASTER STATUS
         38 Query   SELECT @@global.datadir AS Value
         38 Query   SELECT @@global.slave_parallel_workers AS Value
         38 Query   SHOW SLAVE STATUS
         38 Query   SELECT @@global.read_only As Value
         38 Query   SELECT @@global.relay_log_purge As Value
         
14:04:28 38 Query   FLUSH NO_WRITE_TO_BINLOG TABLES
与日志对应，刷不写binlog
         38 Query   SELECT GET_LOCK('MHA_Master_High_Availability_Monitor', '0') AS Value
获取MHA Monitor的应用锁         
         
         38 Query   SHOW PROCESSLIST
14:04:37 38 Query   SELECT RELEASE_LOCK('MHA_Master_High_Availability_Monitor') As Value
释放MHA Monitor的应用锁 
         38 Quit
         
14:04:39 39 Connect mha@10.20.64.209 on
         39 Query   set autocommit=1
         39 Query   SELECT CONNECTION_ID() AS Value
         39 Query   SET wait_timeout=86400
         
         39 Query   FLUSH TABLES WITH READ LOCK
与日志对应，刷读锁
         39 Query   SHOW MASTER STATUS
与日志对应，获取主库binlog和pos
14:04:40 39 Query   UNLOCK TABLES
与日志对应，释放读锁
         39 Query   SELECT RELEASE_LOCK('MHA_Master_High_Availability_Failover') As Value
释放MHA Failover锁
（好奇怪，前面并没有加锁）
         39 Quit

新主库

    Time Id Command Argument
14:04:23 35 Connect mha@10.20.64.209 on
         35 Query   set autocommit=1
         35 Query   SELECT CONNECTION_ID() AS Value
14:04:24 36 Connect mha@10.20.64.209 on
         36 Query   set autocommit=1
         36 Query   SELECT CONNECTION_ID() AS Value
         36 Query   SET wait_timeout=86400
         36 Query   SELECT @@global.server_id As Value
         36 Query   SELECT VERSION() AS Value
         36 Query   SELECT @@global.gtid_mode As Value
         36 Query   SHOW GLOBAL VARIABLES LIKE 'log_bin'
         36 Query   SHOW MASTER STATUS
         36 Query   SELECT @@global.datadir AS Value
         36 Query   SELECT @@global.slave_parallel_workers AS Value
         36 Query   SHOW SLAVE STATUS
         36 Query   SELECT @@global.read_only As Value
         36 Query   SELECT @@global.relay_log_purge As Value
         36 Query   SELECT @@global.relay_log_info_repository AS Value
         
         36 Query   SELECT Relay_log_name FROM mysql.slave_relay_log_info
         36 Query   SELECT @@global.datadir AS Value
         36 Query   SHOW SLAVE STATUS
         36 Query   SELECT Repl_slave_priv AS Value FROM mysql.user WHERE user = 'repl'
14:04:28 36 Query   SELECT GET_LOCK('MHA_Master_High_Availability_Failover', '0') AS Value
获取MHA Failover锁
         36 Query   SHOW SLAVE STATUS
         36 Query   SHOW SLAVE STATUS
14:04:37 36 Query   SHOW PROCESSLIST
14:04:39 36 Query   SHOW SLAVE STATUS
上面获取了很多状态信息
         36 Query   SELECT MASTER_POS_WAIT('mysql-bin.000001','120',0) AS Result
这里使用MASTER_POS_WAIT确保该实例读取并执行完成到指定binlog位置
         36 Query   STOP SLAVE SQL_THREAD
停止同步的SQL线程，也就是说，即使还有relay log堆积，也放弃执行后面的语句
         36 Query   SHOW SLAVE STATUS
         36 Query   SHOW MASTER STATUS
         36 Query   SELECT @@global.read_only As Value
         36 Query   SELECT @@global.read_only As Value
         36 Query   SET GLOBAL read_only=0
与日志影响，这是新主read_only=0，支持写操作
         37 Connect repl@10.20.64.210 on         
         37 Query   SELECT UNIX_TIMESTAMP()
         37 Query   SHOW VARIABLES LIKE 'SERVER_ID'
         37 Query   SET @master_heartbeat_period= 30000001024
         37 Query   SET @master_binlog_checksum= @@global.binlog_checksum
         37 Query   SELECT @master_binlog_checksum
         37 Query   SELECT @@GLOBAL.GTID_MODE
         37 Query   SHOW VARIABLES LIKE 'SERVER_UUID'
         37 Query   SET @slave_uuid= 'c34b6df4-f0c3-11e5-af82-005056ac361f'
         37 Binlog Dump       Log: 'mysql-bin.000004'  Pos: 120
node4从库已完成change master并slave started了         
         38 Connect repl@10.20.64.203 on         
         38 Query   SELECT UNIX_TIMESTAMP()
         38 Query   SHOW VARIABLES LIKE 'SERVER_ID'
         38 Query   SET @master_heartbeat_period= 30000001024
         38 Query   SET @master_binlog_checksum= @@global.binlog_checksum
         38 Query   SELECT @master_binlog_checksum
         38 Query   SELECT @@GLOBAL.GTID_MODE
         38 Query   SHOW VARIABLES LIKE 'SERVER_UUID'
         38 Query   SET @slave_uuid= '1ecea136-e9c5-11e5-81e7-005056ac51d6'
         38 Binlog Dump       Log: 'mysql-bin.000004'  Pos: 120
node2从库已完成change master并slave started了         
14:04:40 36 Query   STOP SLAVE
         36 Query   SHOW SLAVE STATUS
         36 Query   RESET SLAVE /*!50516 ALL */
         36 Query   SHOW SLAVE STATUS
与日志对应，停止与老主库同步，重置同步信息
         36 Query   SELECT RELEASE_LOCK('MHA_Master_High_Availability_Failover') As Value
释放MHA Failover锁
         36 Quit

其余2个从库

    Time  Id Command Argument
14:04:23 220 Connect mha@10.20.64.209 on
         220 Query   set autocommit=1
         220 Query   SELECT CONNECTION_ID() AS Value
14:04:24 221 Connect mha@10.20.64.209 on
         221 Query   set autocommit=1
         221 Query   SELECT CONNECTION_ID() AS Value
         221 Query   SET wait_timeout=86400
         221 Query   SELECT @@global.server_id As Value
         221 Query   SELECT VERSION() AS Value
         221 Query   SELECT @@global.gtid_mode As Value
         221 Query   SHOW GLOBAL VARIABLES LIKE 'log_bin'
         221 Query   SHOW MASTER STATUS
         221 Query   SELECT @@global.datadir AS Value
         221 Query   SELECT @@global.slave_parallel_workers AS Value
         221 Query   SHOW SLAVE STATUS
         221 Query   SELECT @@global.read_only As Value
         221 Query   SELECT @@global.relay_log_purge As Value
         221 Query   SELECT @@global.relay_log_info_repository AS Value
         221 Query   SELECT Relay_log_name FROM mysql.slave_relay_log_info
         221 Query   SELECT @@global.datadir AS Value
         221 Query   SHOW SLAVE STATUS
14:04:28 221 Query   SELECT GET_LOCK('MHA_Master_High_Availability_Failover', '0') AS Value
获取MHA Failover锁
         221 Query   SHOW SLAVE STATUS
         221 Query   SHOW SLAVE STATUS
14:04:39 221 Query   SHOW SLAVE STATUS
         221 Query   SHOW SLAVE STATUS
         221 Query   SELECT MASTER_POS_WAIT('mysql-bin.000001','120',0) AS Result
         221 Query   STOP SLAVE SQL_THREAD
停止与老主库同步
         221 Query   SHOW SLAVE STATUS
         221 Query   STOP SLAVE
         221 Query   SHOW SLAVE STATUS
         221 Query   RESET SLAVE
重置同步信息
         221 Query   CHANGE MASTER TO MASTER_HOST = '10.20.64.204' MASTER_USER = 'repl' MASTER_PASSWORD = <secret> MASTER_PORT = 3306 MASTER_LOG_FILE = 'mysql-bin.000004' MASTER_LOG_POS = 120
         221 Query   START SLAVE
         222 Connect Out       repl@10.20.64.204:3306
         221 Query   SHOW SLAVE STATUS
与日志对应，与新主库建立同步
14:04:40 221 Query   SELECT RELEASE_LOCK('MHA_Master_High_Availability_Failover') As Value
释放MHA Failover锁
         221 Quit

上面模拟了下维护切换，我们更加关心的是故障切换

我们停止了主库mysqld
下面是切换屏显

[mha@n-op-209 etc]$ masterha_master_switch --conf=/usr/local/mha-manager/etc/test2.conf --master_state=dead --interactive=0 --dead_master_host=10.20.64.204 --new_master_host=10.20.64.202 --remove_orig_master_conf
Thu Apr  7 16:57:29 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Apr  7 16:57:29 2016 - [info] Reading application default configuration from /usr/local/mha-manager/etc/test2.conf..
Thu Apr  7 16:57:29 2016 - [info] Reading server configuration from /usr/local/mha-manager/etc/test2.conf..

与alive的日志区别，上面看到alive出现了Phase 1、2、5

Phase 3、4 出现在dead切换中，而且分别还有子阶段，可以更详细的帮助我们了解mha的切换机制

实际内容会被打到日志中，且只有故障切换会达到日志中

/usr/local/mha-manager/logs/test.log
Thu Apr  7 16:57:30 2016 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Apr  7 16:57:30 2016 - [info]
第二阶段：主库宕机
（这里就开始和alive不同了）
Thu Apr  7 16:57:30 2016 - [info] HealthCheck: SSH to 10.20.64.204 is reachable.
检测新主ssh通信
Thu Apr  7 16:57:31 2016 - [info] Forcing shutdown so that applications never connect to the current master..
强制停掉，应用不再连接当前主库（老主库）
Thu Apr  7 16:57:31 2016 - [warning] master_ip_failover_script is not set. Skipping invalidating dead master IP address.
Thu Apr  7 16:57:31 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Apr  7 16:57:31 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Apr  7 16:57:31 2016 - [info]
第二阶段完成：确定老主库已停掉
Thu Apr  7 16:57:31 2016 - [info] * Phase 3: Master Recovery Phase..
Thu Apr  7 16:57:31 2016 - [info]
第三阶段：主库恢复阶段
Thu Apr  7 16:57:31 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Apr  7 16:57:31 2016 - [info]
第三阶段第一步：获取执行到最接近最新的binlog和pos点的从库
Thu Apr  7 16:57:31 2016 - [info] The latest binary log file/position on all slaves is mysql-bin.000004:120
最新的是 mysql-bin.000004:120
Thu Apr  7 16:57:31 2016 - [info] Latest slaves (Slaves that received relay log files to the latest):
并且已接收到relay log中
Thu Apr  7 16:57:31 2016 - [info]   10.20.64.203(10.20.64.203:3306)  Version=5.6.27-76.0-log (oldest major version between
 slaves) log-bin:enabled
Thu Apr  7 16:57:31 2016 - [info]     Replicating from 10.20.64.204(10.20.64.204:3306)
Thu Apr  7 16:57:31 2016 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Apr  7 16:57:31 2016 - [info]   10.20.64.202(10.20.64.202:3306)  Version=5.6.27-76.0-log (oldest major version between
 slaves) log-bin:enabled
Thu Apr  7 16:57:31 2016 - [info]     Replicating from 10.20.64.204(10.20.64.204:3306)
Thu Apr  7 16:57:31 2016 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Apr  7 16:57:31 2016 - [info]   10.20.64.210(10.20.64.210:3306)  Version=5.6.27-76.0-log (oldest major version between
 slaves) log-bin:enabled
Thu Apr  7 16:57:31 2016 - [info]     Replicating from 10.20.64.204(10.20.64.204:3306)
Thu Apr  7 16:57:31 2016 - [info]     Not candidate for the new Master (no_master is set)
Thu Apr  7 16:57:31 2016 - [info] The oldest binary log file/position on all slaves is mysql-bin.000004:120
Thu Apr  7 16:57:31 2016 - [info] Oldest slaves:
Thu Apr  7 16:57:31 2016 - [info]   10.20.64.203(10.20.64.203:3306)  Version=5.6.27-76.0-log (oldest major version between
 slaves) log-bin:enabled
Thu Apr  7 16:57:31 2016 - [info]     Replicating from 10.20.64.204(10.20.64.204:3306)
Thu Apr  7 16:57:31 2016 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Apr  7 16:57:31 2016 - [info]   10.20.64.202(10.20.64.202:3306)  Version=5.6.27-76.0-log (oldest major version between
 slaves) log-bin:enabled
Thu Apr  7 16:57:31 2016 - [info]     Replicating from 10.20.64.204(10.20.64.204:3306)
Thu Apr  7 16:57:31 2016 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Apr  7 16:57:31 2016 - [info]   10.20.64.210(10.20.64.210:3306)  Version=5.6.27-76.0-log (oldest major version between
 slaves) log-bin:enabled
Thu Apr  7 16:57:31 2016 - [info]     Replicating from 10.20.64.204(10.20.64.204:3306)
Thu Apr  7 16:57:31 2016 - [info]     Not candidate for the new Master (no_master is set)
Thu Apr  7 16:57:31 2016 - [info]
这段常规从库检测
Thu Apr  7 16:57:31 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Apr  7 16:57:31 2016 - [info]
第三阶段第二步：保存老主库binlog阶段
Thu Apr  7 16:57:31 2016 - [info] Fetching dead master's binary logs..
开始获取主库binlog
Thu Apr  7 16:57:31 2016 - [info] Executing command on the dead master 10.20.64.204(10.20.64.204:3306): save_binary_logs -
-command=save --start_file=mysql-bin.000004  --start_pos=120 --binlog_dir=/data/mysqldata/3306/binlog --output_file=/usr/l
ocal/mha-node/apps/test/saved_master_binlog_from_10.20.64.204_3306_20160407165729.binlog --handle_raw_binlog=1 --disable_l
og_bin=0 --manager_version=0.56 --client_bindir=/usr/local/mysql/bin --client_libdir=/usr/local/mysql/lib
老主库本地脚本save_binary_logs读取最新的binlog和pos，将数据保存到node工作目录
  Creating /usr/local/mha-node/apps/test if not exists..    ok.
  
 Concat binary/relay logs from mysql-bin.000004 pos 120 to mysql-bin.000004 EOF into /usr/local/mha-node/apps/test/saved_m
aster_binlog_from_10.20.64.204_3306_20160407165729.binlog ..
 Binlog Checksum enabled
  Dumping binlog format description event, from position 0 to 120.. ok.
合并pos点120点到结尾的数据到save_master_binlog文件  
  
  Dumping effective binlog data from /data/mysqldata/3306/binlog/mysql-bin.000004 position 120 to tail(143).. ok.
 Binlog Checksum enabled
 Concat succeeded.
最终合并了pos点从120到143的数据
Thu Apr  7 16:57:31 2016 - [info] scp from mha@10.20.64.204:/usr/local/mha-node/apps/test/saved_master_binlog_from_10.20.6
4.204_3306_20160407165729.binlog to local:/usr/local/mha-manager/apps/test/saved_master_binlog_from_10.20.64.204_3306_2016
0407165729.binlog succeeded.
将老主库合并后的slave_master_binlog文件通过scp的方式，拿到manager的工作目录
Thu Apr  7 16:57:31 2016 - [info] HealthCheck: SSH to 10.20.64.203 is reachable.
Thu Apr  7 16:57:32 2016 - [info] HealthCheck: SSH to 10.20.64.202 is reachable.
Thu Apr  7 16:57:32 2016 - [info] HealthCheck: SSH to 10.20.64.210 is reachable.
Thu Apr  7 16:57:33 2016 - [info]
检测从库ssh通讯
Thu Apr  7 16:57:33 2016 - [info] * Phase 3.3: Determining New Master Phase..
Thu Apr  7 16:57:33 2016 - [info]
第三阶段第三步：确定新主阶段
Thu Apr  7 16:57:33 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
寻找所有从库节点中，relay log被恢复到最新pos的从库节点
Thu Apr  7 16:57:33 2016 - [info] All slaves received relay logs to the same position. No need to resync each other.
所有的从库全部恢复到同一个pos点。且不需要mha在从节点间做额外的binlog同步操作
Thu Apr  7 16:57:33 2016 - [info] 10.20.64.202 can be new master.
Thu Apr  7 16:57:33 2016 - [info] New master is 10.20.64.202(10.20.64.202:3306)
按照指定，node1为新主
Thu Apr  7 16:57:33 2016 - [info] Starting master failover..
Thu Apr  7 16:57:33 2016 - [info]
下面开始故障转移工作
From:
10.20.64.204(10.20.64.204:3306) (current master) (test master)
 +--10.20.64.203(10.20.64.203:3306) (test standby)
 +--10.20.64.202(10.20.64.202:3306) (test standby)
 +--10.20.64.210(10.20.64.210:3306) (test slave)
阐述当前架构
To:
10.20.64.202(10.20.64.202:3306) (new master) (test standby)
 +--10.20.64.203(10.20.64.203:3306) (test standby)
 +--10.20.64.210(10.20.64.210:3306) (test slave)
阐述目标架构
Thu Apr  7 16:57:33 2016 - [info]
Thu Apr  7 16:57:33 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Apr  7 16:57:33 2016 - [info]
第三阶段第三步：新主库生成diff log
Thu Apr  7 16:57:33 2016 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
manager服务器拥有全部的relay log，不需要从最近同步的从库生成diff relay log
(如果这里没有拿到主库的binlog差异，那么需要从最新的slave scp relay log到其他从库，并在其他从库上执行apply_diff_relay_log讲数据补齐）
Thu Apr  7 16:57:33 2016 - [info] Sending binlog..
这时，将save_binary_logs生成的老主库binlog推送到新主库的工作目录下
Thu Apr  7 16:57:33 2016 - [info] scp from local:/usr/local/mha-manager/apps/test/saved_master_binlog_from_10.20.64.204_33
06_20160407165729.binlog to mha@10.20.64.202:/usr/local/mha-node/apps/test/saved_master_binlog_from_10.20.64.204_3306_2016
0407165729.binlog succeeded.
通过scp方式推送到新主的工作目录下
Thu Apr  7 16:57:33 2016 - [info]
Thu Apr  7 16:57:33 2016 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Apr  7 16:57:33 2016 - [info]
第三阶段第四步：主库binlog应用阶段
Thu Apr  7 16:57:33 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
提示：如果这个阶段发生报错，那么需要手动去执行apply log操作
Thu Apr  7 16:57:33 2016 - [info] Starting recovery on 10.20.64.202(10.20.64.202:3306)..
在新主库上开始执行save的老主库binlog内容
Thu Apr  7 16:57:33 2016 - [info]  Generating diffs succeeded.
生成差异
Thu Apr  7 16:57:33 2016 - [info] Waiting until all relay logs are applied.
Thu Apr  7 16:57:33 2016 - [info]  done.
这里需要等待新主库上，所有的relay log都已执行完
Thu Apr  7 16:57:33 2016 - [info] Getting slave status..
在新主库上，获取从库状态，来确定它执行和读取的老主库的binlog和pos点信息
Thu Apr  7 16:57:33 2016 - [info] This slave(10.20.64.202)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.0
00004:120). No need to recover from Exec_Master_Log_Pos.
这里表示执行和读取的位置相同，也就是说，没有延迟
Thu Apr  7 16:57:33 2016 - [info] Connecting to the target slave host 10.20.64.202, running recover script..
连接从库（也就是我们指定的新主库）去执行apply_diff_relay_logs脚本
Thu Apr  7 16:57:33 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host
=10.20.64.202 --slave_ip=10.20.64.202  --slave_port=3306 --apply_files=/usr/local/mha-node/apps/test/saved_master_binlog_f
rom_10.20.64.204_3306_20160407165729.binlog --workdir=/usr/local/mha-node/apps/test --target_version=5.6.27-76.0-log --tim
estamp=20160407165729 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --client_bindir=/usr/local/mysql/bi
n --client_libdir=/usr/local/mysql/lib --slave_pass=xxx
这些是ssh到node1节点，本地执行apply_diff_relay_logs脚本
将老主库的pos点120-143的binlog内容应用到新主库
Thu Apr  7 16:57:33 2016 - [info]
MySQL client version is 5.6.27. Using --binary-mode.
Applying differential binary/relay log files /usr/local/mha-node/apps/test/saved_master_binlog_from_10.20.64.204_3306_2016
0407165729.binlog on 10.20.64.202:3306. This may take long time...
Applying log files succeeded.
根据binlog内容的不同，这个部分可能会执行很长时间
Thu Apr  7 16:57:33 2016 - [info]  All relay logs were successfully applied.
（我这里为静态测试，所以很快执行完了，且无报错）
Thu Apr  7 16:57:33 2016 - [info] Getting new master's binlog name and position..
Thu Apr  7 16:57:33 2016 - [info]  mysql-bin.000001:120
获取新主库的binlog和pos点信息（用来为后面自动重构主从做准备）
Thu Apr  7 16:57:33 2016 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER
 TO MASTER_HOST='10.20.64.202', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=120, MASTER_USER='rep
l', MASTER_PASSWORD='xxx';
所有的从库重新change master到新主
Thu Apr  7 16:57:33 2016 - [warning] master_ip_failover_script is not set. Skipping taking over new master IP address.
（我们没有配置master_ip_failover_script脚本，这里也跳过了接管新主ip地址的操作）
Thu Apr  7 16:57:33 2016 - [info] ** Finished master recovery successfully.
成功完成新主库恢复操作（主要是将老主库应用到新主库，保证数据最新）
Thu Apr  7 16:57:33 2016 - [info] * Phase 3: Master Recovery Phase completed.
Thu Apr  7 16:57:33 2016 - [info]
到此新主库接管完毕
Thu Apr  7 16:57:33 2016 - [info] * Phase 4: Slaves Recovery Phase..
Thu Apr  7 16:57:33 2016 - [info]
第四阶段：从库恢复阶段
Thu Apr  7 16:57:33 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Apr  7 16:57:33 2016 - [info]
第四阶段第一步：从库并行生成diff log
Thu Apr  7 16:57:33 2016 - [info] -- Slave diff file generation on host 10.20.64.203(10.20.64.203:3306) started, pid: 3026
在node2上生成diff log
2. Check tmp log /usr/local/mha-manager/apps/test/10.20.64.203_3306_20160407165729.log if it takes time..
3. Thu Apr  7 16:57:33 2016 - [info] -- Slave diff file generation on host 10.20.64.210(10.20.64.210:3306) started, pid: 3026
在node4上生成diff log
3. Check tmp log /usr/local/mha-manager/apps/test/10.20.64.210_3306_20160407165729.log if it takes time..
Thu Apr  7 16:57:34 2016 - [info]
从库本地生成diff log完成
Thu Apr  7 16:57:34 2016 - [info] Log messages from 10.20.64.210 ...
Thu Apr  7 16:57:34 2016 - [info]
Thu Apr  7 16:57:33 2016 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Thu Apr  7 16:57:34 2016 - [info] End of log messages from 10.20.64.210.
Thu Apr  7 16:57:34 2016 - [info] -- 10.20.64.210(10.20.64.210:3306) has the latest relay log events.
Thu Apr  7 16:57:34 2016 - [info]
node4节点返回的日志信息，relay log为最新的
Thu Apr  7 16:57:34 2016 - [info] Log messages from 10.20.64.203 ...
Thu Apr  7 16:57:34 2016 - [info]
Thu Apr  7 16:57:33 2016 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Thu Apr  7 16:57:34 2016 - [info] End of log messages from 10.20.64.203.
Thu Apr  7 16:57:34 2016 - [info] -- 10.20.64.203(10.20.64.203:3306) has the latest relay log events.
node2节点返回的日志信息，relay log为最新的
Thu Apr  7 16:57:34 2016 - [info] Generating relay diff files from the latest slave succeeded.
Thu Apr  7 16:57:34 2016 - [info]
从最新的从库生成diff log完成
Thu Apr  7 16:57:34 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Apr  7 16:57:34 2016 - [info]
第四阶段第二步：从库并行应用从库日志
Thu Apr  7 16:57:34 2016 - [info] -- Slave recovery on host 10.20.64.203(10.20.64.203:3306) started, pid: 30266. Check tmp
 log /usr/local/mha-manager/apps/test/10.20.64.203_3306_20160407165729.log if it takes time..
 
Thu Apr  7 16:57:34 2016 - [info] -- Slave recovery on host 10.20.64.210(10.20.64.210:3306) started, pid: 30267. Check tmp
 log /usr/local/mha-manager/apps/test/10.20.64.210_3306_20160407165729.log if it takes time..
Thu Apr  7 16:57:35 2016 - [info]
分别在2个节点进行恢复操作
Thu Apr  7 16:57:35 2016 - [info] Log messages from 10.20.64.203 ...
Thu Apr  7 16:57:35 2016 - [info]
Thu Apr  7 16:57:34 2016 - [info] Sending binlog..
Thu Apr  7 16:57:34 2016 - [info] scp from local:/usr/local/mha-manager/apps/test/saved_master_binlog_from_10.20.64.204_33
06_20160407165729.binlog to mha@10.20.64.203:/usr/local/mha-node/apps/test/saved_master_binlog_from_10.20.64.204_3306_2016
0407165729.binlog succeeded.
将老主库最后的binlog拷贝到node2节点
Thu Apr  7 16:57:34 2016 - [info] Starting recovery on 10.20.64.203(10.20.64.203:3306)..
Thu Apr  7 16:57:34 2016 - [info]  Generating diffs succeeded.
Thu Apr  7 16:57:34 2016 - [info] Waiting until all relay logs are applied.
Thu Apr  7 16:57:34 2016 - [info]  done.
在node2节点完成老主库binlog的恢复操作，如果内容比较多，恢复时间会比较长
Thu Apr  7 16:57:34 2016 - [info] Getting slave status..
Thu Apr  7 16:57:34 2016 - [info] This slave(10.20.64.203)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.0
00004:120). No need to recover from Exec_Master_Log_Pos.
获取从库同步信息，查看执行和读取的老主库binlog和pos点信息
Thu Apr  7 16:57:34 2016 - [info] Connecting to the target slave host 10.20.64.203, running recover script..
ssh连接到node2节点，并执行apply_diff_relay_logs脚本
Thu Apr  7 16:57:34 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host
=10.20.64.203 --slave_ip=10.20.64.203  --slave_port=3306 --apply_files=/usr/local/mha-node/apps/test/saved_master_binlog_f
rom_10.20.64.204_3306_20160407165729.binlog --workdir=/usr/local/mha-node/apps/test --target_version=5.6.27-76.0-log --tim
estamp=20160407165729 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --client_bindir=/usr/local/mysql/bi
n --client_libdir=/usr/local/mysql/lib --slave_pass=xxx
Thu Apr  7 16:57:35 2016 - [info]
向从库执行老主库的binlog
MySQL client version is 5.6.27. Using --binary-mode.
Applying differential binary/relay log files /usr/local/mha-node/apps/test/saved_master_binlog_from_10.20.64.204_3306_2016
0407165729.binlog on 10.20.64.203:3306. This may take long time...
Applying log files succeeded.
Thu Apr  7 16:57:35 2016 - [info]  All relay logs were successfully applied.
执行binlog完成，如果binlog内容多，执行时间会比较长
（这时node2节点和新主的数据一致，都恢复到老主库最新的binlog点上）
Thu Apr  7 16:57:35 2016 - [info]  Resetting slave 10.20.64.203(10.20.64.203:3306) and starting replication from the new m
aster 10.20.64.202(10.20.64.202:3306)..
那么，这时可以进行reset slave操作，并建立新的主从关系
Thu Apr  7 16:57:35 2016 - [info]  Executed CHANGE MASTER.
Thu Apr  7 16:57:35 2016 - [info]  Slave started.
Thu Apr  7 16:57:35 2016 - [info] End of log messages from 10.20.64.203.
Thu Apr  7 16:57:35 2016 - [info] -- Slave recovery on host 10.20.64.203(10.20.64.203:3306) succeeded.
Thu Apr  7 16:57:35 2016 - [info]
到此，node2节点已和新主完成主从同步
（到这里，我们可以认为故障转移已完成，且非单点）
下面剩余从节点操作与node2节点操作相同
Thu Apr  7 16:57:35 2016 - [info] Log messages from 10.20.64.210 ...
Thu Apr  7 16:57:35 2016 - [info]
Thu Apr  7 16:57:34 2016 - [info] Sending binlog..
Thu Apr  7 16:57:34 2016 - [info] scp from local:/usr/local/mha-manager/apps/test/saved_master_binlog_from_10.20.64.204_33
06_20160407165729.binlog to mha@10.20.64.210:/usr/local/mha-node/apps/test/saved_master_binlog_from_10.20.64.204_3306_2016
0407165729.binlog succeeded.
Thu Apr  7 16:57:34 2016 - [info] Starting recovery on 10.20.64.210(10.20.64.210:3306)..
Thu Apr  7 16:57:34 2016 - [info]  Generating diffs succeeded.
Thu Apr  7 16:57:34 2016 - [info] Waiting until all relay logs are applied.
Thu Apr  7 16:57:34 2016 - [info]  done.
Thu Apr  7 16:57:34 2016 - [info] Getting slave status..
Thu Apr  7 16:57:34 2016 - [info] This slave(10.20.64.210)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.0
00004:120). No need to recover from Exec_Master_Log_Pos.
Thu Apr  7 16:57:34 2016 - [info] Connecting to the target slave host 10.20.64.210, running recover script..
Thu Apr  7 16:57:34 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host
=10.20.64.210 --slave_ip=10.20.64.210  --slave_port=3306 --apply_files=/usr/local/mha-node/apps/test/saved_master_binlog_f
rom_10.20.64.204_3306_20160407165729.binlog --workdir=/usr/local/mha-node/apps/test --target_version=5.6.27-76.0-log --tim
estamp=20160407165729 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --client_bindir=/usr/local/mysql/bi
n --client_libdir=/usr/local/mysql/lib --slave_pass=xxx
Thu Apr  7 16:57:35 2016 - [info]
MySQL client version is 5.6.27. Using --binary-mode.
Applying differential binary/relay log files /usr/local/mha-node/apps/test/saved_master_binlog_from_10.20.64.204_3306_2016
0407165729.binlog on 10.20.64.210:3306. This may take long time...
Applying log files succeeded.
Thu Apr  7 16:57:35 2016 - [info]  All relay logs were successfully applied.
Thu Apr  7 16:57:35 2016 - [info]  Resetting slave 10.20.64.210(10.20.64.210:3306) and starting replication from the new m
aster 10.20.64.202(10.20.64.202:3306)..
Thu Apr  7 16:57:35 2016 - [info]  Executed CHANGE MASTER.
Thu Apr  7 16:57:35 2016 - [info]  Slave started.
Thu Apr  7 16:57:35 2016 - [info] End of log messages from 10.20.64.210.
Thu Apr  7 16:57:35 2016 - [info] -- Slave recovery on host 10.20.64.210(10.20.64.210:3306) succeeded.
Thu Apr  7 16:57:35 2016 - [info] All new slave servers recovered successfully.
Thu Apr  7 16:57:35 2016 - [info]
到此，新架构完成，所有从节点已与新主建立主从同步
Thu Apr  7 16:57:35 2016 - [info] * Phase 5: New master cleanup phase..
Thu Apr  7 16:57:35 2016 - [info]
第四阶段：新主库清理工作
Thu Apr  7 16:57:35 2016 - [info] Resetting slave info on the new master..
Thu Apr  7 16:57:35 2016 - [info]  10.20.64.202: Resetting slave info succeeded.
node1已经是主库身份，那么它不再隶属于谁，所以进行reset slave操作，清理同步信息
Thu Apr  7 16:57:35 2016 - [info] Master failover to 10.20.64.202(10.20.64.202:3306) completed successfully.
到此，故障转移成功完成
Thu Apr  7 16:57:35 2016 - [info] Deleted server1 entry from /usr/local/mha-manager/etc/test2.conf .
Thu Apr  7 16:57:35 2016 - [info]
清理配置文件中老主库信息，我们可以使用这个配置文件，再次启动masterha_manager脚本
----- Failover Report -----
这里会总结一个故障转移报告
test2: MySQL Master failover 10.20.64.204(10.20.64.204:3306) to 10.20.64.202(10.20.64.202:3306) succeeded
test2应用成功将204故障转移到202
Master 10.20.64.204(10.20.64.204:3306) is down!
老主库204:3306实例宕掉
Check MHA Manager logs at n-op-209.corp.ncfgroup.com:/usr/local/mha-manager/logs/test.log for details.
切换日志路径
Started automated(non-interactive) failover.
The latest slave 10.20.64.203(10.20.64.203:3306) has all relay logs for recovery.
203是最近的实例
Selected 10.20.64.202(10.20.64.202:3306) as a new master.
选择202作为新主（这个是我们用选项指定的）
10.20.64.202(10.20.64.202:3306): OK: Applying all logs succeeded.
老主库的binlog已成功应用于所有实例
10.20.64.210(10.20.64.210:3306): This host has the latest relay log events.
10.20.64.203(10.20.64.203:3306): This host has the latest relay log events.
210和203的relay log都是最新的事件
Generating relay diff files from the latest slave succeeded.
从最近的从节点获取relay log的diff成功
10.20.64.203(10.20.64.203:3306): OK: Applying all logs succeeded. Slave started, replicating from 10.20.64.202(10.20.64.20
2:3306)
node2将与新主node4的relay log的差异执行完成
10.20.64.210(10.20.64.210:3306): OK: Applying all logs succeeded. Slave started, replicating from 10.20.64.202(10.20.64.20
2:3306)
node5将与新主node4的relay log的差异执行完成
分别在从库执行diff relay log的内容，将数据补齐，并与新主建立主从同步
10.20.64.202(10.20.64.202:3306): Resetting slave info succeeded.
Master failover to 10.20.64.202(10.20.64.202:3306) completed successfully.
新主清理与老主同步关系
到此，故障转移成功完成

我们分别到mha-node和mha-manager的工作目录看看

1 2	/usr/local/mha-node/apps/test /usr/local/mha-manager/apps/test

mha-node

老主库
-rw-r--r-- 1 mha mysql  143 4月   7 16:57 saved_master_binlog_from_10.20.64.204_3306_20160407165729.binlog
新主库
-rw-r--r-- 1 mha mysql  143 4月   7 16:57 saved_master_binlog_from_10.20.64.204_3306_20160407165729.binlog
剩余2从库抽样
-rw-r--r-- 1 mha mysql  143 4月   7 16:57 saved_master_binlog_from_10.20.64.204_3306_20160407165729.binlog
-rw-r--r-- 1 mha mysql 1096 4月   7 16:57 relay_log_apply_for_10.20.64.210_3306_20160407165729_err.log

mha-manager

1 2	-rw-r--r-- 1 mha mysql 143 4月 7 16:57 saved_master_binlog_from_10.20.64.204_3306_20160407165729.binlog -rw-r--r-- 1 mha mysql 0 4月 7 16:57 test2.failover.complete

配置文件

[mha@n-op-209 etc]$ cat test2.conf
[server default]
...
这里，老主库的[server1]块被清除了
[server2]
candidate_master=1
hostname=10.20.64.203
node_label=test standby
[server3]
candidate_master=1
hostname=10.20.64.202
node_label=test standby
[server4]
hostname=10.20.64.210
no_master=1
node_label=test slave

结论：MHA会产生脑裂，目前都是HA，所以DBA处理好写指向问题

另外，从MySQL本身来说，可以结合Semisynchronous Replication半同步，应该是一个不错的选择