【 MGR 】02 - Summary of configuring MGR cluster

文章目录

1. Introduction
2. 第一个问题: 强制使用root创建mgr相关账户
3. 第二个问题: MGR集群成员host的默认值
4. Summary

Introduction

This is second post in the series. In this post, i will share something tricky.

本文讲述MGR两个比较诡异的地方:

加载 MGR plugin的时候,用写死root用户来进行plugin associated user account 的创建
配置组成员的时候, 默认member_host会采用系统的 hostname,导致无法加入新成员

第一个问题: 强制使用root创建mgr相关账户

详细描述

当系统中没有root账户的时候, 无法安装 MGR plugin, 报错. 经过分析代码, 发现是加载 MGR plugin 的时候,
需要创建 _gr_user 账户, 创建账户的代码写死了,必须使用 root .

代码分析

代码文件路径:

rapid/plugin/group_replication/src/sql_service/sql_service_gr_user.cc

具体代码:

int create_group_replication_user(bool threaded,
                                  Sql_service_interface *sql_interface)
{
  DBUG_ENTER("create_group_replication_user");
  int error = 0;
  Sql_service_interface *server_interface= NULL;
  if (sql_interface == NULL) {
    server_interface= new Sql_service_interface();
    if (!threaded)
      error = server_interface->open_session();
    else
      error = server_interface->open_thread_session(get_plugin_pointer());

    if (error) {
      /* purecov: begin inspected */
      log_message(MY_ERROR_LEVEL,
                  "Can't establish a internal server connection to execute"
                  " plugin operations");
      delete server_interface;
      DBUG_RETURN(error);
      /* purecov: end */
    }
  }
  else
  {
    server_interface= sql_interface;
  }

  error= server_interface->set_session_user("root");
    
  if (error)
  {
    /* purecov: begin inspected */
    log_message(MY_ERROR_LEVEL,
                "Can't use the root account to create the plugin associated user"
                " account to access the server.");
    if (sql_interface == NULL)
      delete server_interface;
    DBUG_RETURN(error);
    /* purecov: end */
  }

  long srv_err= 0;
  std::string query;
  query.assign("SET @GR_OLD_LOG_BIN=@@SQL_LOG_BIN;");
  if ((srv_err= execute_user_query(server_interface,query)))
    goto err; /* purecov: inspected */

  query.assign("SET SESSION SQL_LOG_BIN=0;");
  if ((srv_err= execute_user_query(server_interface,query)))
    goto err; /* purecov: inspected */

  query.assign("CREATE USER IF NOT EXISTS " GROUPREPL_ACCOUNT " IDENTIFIED"
               " WITH mysql_native_password AS"
               " '*7CF5CA9067EC647187EB99FCC27548FBE4839AE3' ACCOUNT LOCK;");
  if ((srv_err= execute_user_query(server_interface,query)))
    goto err; /* purecov: inspected */

  query.assign("GRANT SELECT ON performance_schema.replication_connection_status"
               " TO " GROUPREPL_ACCOUNT);
  if ((srv_err= execute_user_query(server_interface,query)))
    goto err; /* purecov: inspected */

  query.assign("GRANT SUPER ON *.* TO " GROUPREPL_ACCOUNT);
  if ((srv_err= execute_user_query(server_interface,query)))
    goto err; /* purecov: inspected */

  query.assign("FLUSH PRIVILEGES;");
  if ((srv_err= execute_user_query(server_interface,query)))
    goto err; /* purecov: inspected */

err:
  query.assign("SET SESSION SQL_LOG_BIN=@GR_OLD_LOG_BIN;");
  srv_err+= execute_user_query(server_interface,query);

  if (sql_interface == NULL)
  {
    delete server_interface;
  }

  DBUG_RETURN(srv_err);
}

第29行将root用户写死在里面了, 本来安装我上一篇文章进行部署是不会有问题的, 我们组小伙伴部署的时候,将root账户删除了, 提示无法加载plugin, 报如下错误.

2016-12-26T13:20:54.037658+08:00 8 [ERROR] Plugin group_replication reported: 'Can't use the root account to create the plugin associated user account to access the server.'
2016-12-26T13:20:54.037666+08:00 8 [ERROR] Plugin group_replication reported: 'Could not evaluate if the group replication user is present in the server'
2016-12-26T13:20:54.037692+08:00 8 [Note] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2016-12-26T13:20:54.037699+08:00 8 [ERROR] Plugin group_replication reported: 'Error calling group communication interfaces while trying to leave the group'

问题处理

既然找到原因, 把root账户恢复, 就可以正常加载了.

这个问题,已经向官方提交了bug
链接如下: http://bugs.mysql.com/bug.php?id=84337
不过后来发现八月份的时候就有一个在lab版本的bug http://bugs.mysql.com/bug.php?id=82687 , 而发布正式版的时候没有修正.

建议: 不写死root账户

可以设置参数控制, 可以动态修改值, 默认值为 root
创建用户的时候,获取有创建新账户的用户,然后创建MGR 相关内部账户

第二个问题: MGR集群成员host的默认值

问题描述

容器hostname:

[root@b9cc57bc41f3 /]# hostname
b9cc57bc41f3
[root@b9cc57bc41f3 /]# hostname -i
172.17.0.2

配置完成第一个节点:
查看组member如下:

root@(none) 02:14:47>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 1894ecbf-cd6a-11e6-b76f-0242ac110002 | b9cc57bc41f3 |        3306 | ONLINE       |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
1 row in set (0.00 sec)

MEMBER_HOST 一列显示是容器的 hostname, 这样一看貌似没有问题, 但是会导致加入新节点成功,但是做全局恢复的时候失败.(节点加入逻辑后续会详细讲解).
下面我们加入第二个节点:
加入第二个节点之后:

root@(none) 02:18:20>root@(none) 02:18:20>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 1894ecbf-cd6a-11e6-b76f-0242ac110002 | b9cc57bc41f3 |        3306 | ONLINE       |
| group_replication_applier | 1eff4c26-cd6a-11e6-b90d-0242ac110003 | 906ac9ba2dd9 |        3306 | RECOVERING   |
+---------------------------+--------------------------------------+--------------+-------------+--------------+

在新加入的新节点上面看日志信息:

'Starting group replication recovery with view_id 14829776327826555:2'

2016-12-29T02:17:56.390895-00:00 13 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='b9cc57bc41f3', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2016-12-29T02:17:56.699417-00:00 13 [Note] Plugin group_replication reported: 'Establishing connection to a group replication recovery donor 1894ecbf-cd6a-11e6-b76f-0242ac110002 at b9cc57bc41f3 port: 3306.'
2016-12-29T02:17:56.700203-00:00 14 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2016-12-29T02:17:56.730434-00:00 15 [Note] Slave SQL thread for channel 'group_replication_recovery' initialized, starting replication in log 'FIRST' at position 0, relay log '/var/lib/mysql/relaylog-group_replication_recovery.000001' position: 4
2016-12-29T02:18:29.761682-00:00 14 [ERROR] Slave I/O for channel 'group_replication_recovery': error connecting to master 'repl@b9cc57bc41f3:3306' - retry-time: 60  retries: 1, Error_code: 2005
2016-12-29T02:18:29.761741-00:00 14 [Note] Slave I/O thread for channel 'group_replication_recovery' killed while connecting to master ......

其中可以看到: 选择donor的时候, 是通过hostname 做change master的, 然而 hostname b9cc57bc41f3 是不可以解析的. 导致了

1
2
3

CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. 
Previous state master_host='b9cc57bc41f3', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind='' .
New state master_host='b9cc57bc41f3', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''

从而导致了无法正常连接到 donor, 最终导致恢复失败, replication_group_members 信息中 MEMBER_STATE 一直是 RECOVERING状态.

There was an error when connecting to the donor server.
Check group replication recovery's connection credentials.

当recovery恢复尝试一定次数之后, 会退出MGR Cluster:

1 2	2016-12-29T02:32:29.568996-00:00 13 [ERROR] Plugin group_replication reported: 'Fatal error during the Recovery process of Group Replication. The server will leave the group.'

代码分析

代码文件路径:

rapid/plugin/group_replication/src/plugin.cc

调试到具体代码发现是在 configure_group_member_manager 函数中声明的hostname, 调用 get_server_parameters函数进行初始化的

具体configure_group_member_manager函数代码:

int configure_group_member_manager()
{
  DBUG_ENTER("configure_group_member_manager");

  /*
    Ensure that group communication interfaces are initialized
    and ready to use, since plugin can leave the group on errors
    but continue to be active.
  */
  std::string gcs_local_member_identifier;
  if (gcs_module->get_local_member_identifier(gcs_local_member_identifier))
  {
    /* purecov: begin inspected */
    log_message(MY_ERROR_LEVEL, "Error calling group communication interfaces");
    DBUG_RETURN(GROUP_REPLICATION_COMMUNICATION_LAYER_SESSION_ERROR);
    /* purecov: end */
  }

  //Configure Group Member Manager
  char *hostname, *uuid;
  uint port;
  unsigned int server_version;
  get_server_parameters(&hostname, &port, &uuid, &server_version);
  plugin_version= server_version;

  uint32 local_version= plugin_version;
  DBUG_EXECUTE_IF("group_replication_compatibility_higher_patch_version",
                  {
                    local_version= plugin_version + (0x000001);
                  };);
  DBUG_EXECUTE_IF("group_replication_compatibility_higher_minor_version",
                  {
                    local_version= plugin_version + (0x000100);
                  };);
  DBUG_EXECUTE_IF("group_replication_compatibility_higher_major_version",
                  {
                    local_version= plugin_version + (0x010000);
                  };);
  Member_version local_member_plugin_version(local_version);
  delete local_member_info;
  local_member_info= new Group_member_info(hostname,
                                           port,
                                           uuid,
                                           write_set_extraction_algorithm,
                                           gcs_local_member_identifier,
                                           Group_member_info::MEMBER_OFFLINE,
                                           local_member_plugin_version,
                                           gtid_assignment_block_size_var,
                                           Group_member_info::MEMBER_ROLE_SECONDARY,
                                           single_primary_mode_var,
                                           enforce_update_everywhere_checks_var);

  //Create the membership info visible for the group
  delete group_member_mgr;
  group_member_mgr= new Group_member_info_manager(local_member_info);

  DBUG_RETURN(0);
}

关于get_server_parameters:

路径:

sql/rpl_group_replication.cc -- row_num 326  行

代码:

void get_server_parameters(char **hostname, uint *port, char** uuid,
                           unsigned int *out_server_version)
{
  /*
    use startup option report-host and report-port when provided,
    as value provided by glob_hostname, which used gethostname() function
    internally to determine hostname, will not always provide correct
    network interface, especially in case of multiple network interfaces.
  */
  if (report_host)
    *hostname= report_host;
  else
    *hostname= glob_hostname;

  if (report_port)
    *port= report_port;
  else
    *port= mysqld_port;

  *uuid= server_uuid;

  //Convert server version to hex

  ulong major= 0, minor= 0, patch= 0;
  char *pos= server_version, *end_pos;
  //extract each server decimal number, e.g., for 5.9.30 -> 5, 9 and 30
  major= strtoul(pos, &end_pos, 10);  pos=end_pos+1;
  minor= strtoul(pos, &end_pos, 10);  pos=end_pos+1;
  patch= strtoul(pos, &end_pos, 10);

  /*
    Convert to a equivalent hex representation.
    5.9.30 -> 0x050930
    version= 0 x 16^5 + 5 x 16^4 + 0 x 16^3 + 9 x 16^2 + 3 x 16^1 + 0 x 16^0
  */
  int v1= patch / 10;
  int v0= patch - v1 * 10;
  int v3= minor / 10;
  int v2= minor - v3 * 10;
  int v5= major / 10;
  int v4= major - v5 * 10;

  *out_server_version= v0 + v1 * 16 + v2 * 256 + v3 * 4096 + v4 * 65536 + v5 * 1048576;

  return;
}

从10-20行可以看出 hostname 是可以通过 report_host来进行配置的.

问题处理

问题原因分析出来了,解决办法也就很明显了:

解决办法 1

每个实例部署的时候,配置好 report_host

设置完 report_host 之后:

root@(none) 09:24:01>show variables like '%report%';
+-----------------+------------+
| Variable_name   | Value      |
+-----------------+------------+
| report_host     | 172.17.0.2 |
| report_port     | 3306       | 
+-----------------+------------+

配置完成之后: (member_host列展示的是 docker的 ip)

root@(none) 09:23:59>select * from  performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 764aaaf4-cda6-11e6-84d0-0242ac110002 | 172.17.0.2  |        3306 | ONLINE       |
| group_replication_applier | c8379254-cda6-11e6-8738-0242ac110003 | 172.17.0.3  |        3306 | ONLINE       |
| group_replication_applier | cea19648-cda6-11e6-8817-0242ac110004 | 172.17.0.4  |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

解决办法 2

把组里的成员列表, ip 和host 的对应关系写入到 /etc/hosts 文件.

Summary

MGR的配置过程已经算是很简单了,内部自动化程度相对比较高了, 遇到的问题,记录分享一下, 然大家也少走一些弯路. 后续…

ruiayLin's Blog

蹈之而弗悔 , 慎思而笃行 ...

【 MGR 】02 - Summary of configuring MGR cluster

Introduction

第一个问题: 强制使用root创建mgr相关账户

详细描述

代码分析

问题处理

第二个问题: MGR集群成员host的默认值

问题描述

代码分析

问题处理

解决办法 1

解决办法 2

Summary