首页 > 数据库 >KingbaseES数据库自动故障转移失败(Automatic Database Failover Failed)

KingbaseES数据库自动故障转移失败(Automatic Database Failover Failed)

时间:2023-02-21 19:14:31浏览次数:58  
标签:02 10 Database ping Failed 2023 11 Automatic 节点

KingbaseES V8R6版本 数据库自动故障转移失败(Automatic Database Failover Failed)

适用于:

KingbaseES V8R6 版本。

repmgr配置信息:

首先检查repmgr.conf配置文件,确任数据库主节点,数据库备节点参数:failover='automatic'、recovery='standby'一致

一、故障现象:

  1. 数据库自动故障转移失败,也就是failover切换失败。
  2. 其他的正常可用的备节点未被选择(切换)成为新的主数据节点。
  3. 在KingbaseES数据库可用备节点hamgr.log日志,可以看到类似于以下的信息条目。
[2023-02-10 11:39:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-02-10 11:39:46] [INFO] sleeping up to 6 seconds until next reconnection attempt
[2023-02-10 11:39:52] [INFO] checking state of node "node20" (ID: 1), 10 of 10 attempts
[2023-02-10 11:39:52] [WARNING] unable to reconnect to node "node20" (ID: 1) after 10 attempts
[2023-02-10 11:39:52] [NOTICE] repmgrd on this node is paused
[2023-02-10 11:39:52] [DETAIL] no failover will be carried out
[2023-02-10 11:39:52] [HINT] execute "repmgr service unpause" to resume normal failover mode
  1. 在KingbaseES数据库可用备节点kbha.log日志,可以看到类似于以下的信息条目。
[2023-02-10 10:31:34] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID
[2023-02-10 10:31:34] [DEBUG] repmgrd is running, can not start another one.
[2023-02-10 10:31:37] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.

--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.729/0.745/0.761/0.016 ms
...
[2023-02-10 11:32:37] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID
[2023-02-10 11:32:37] [DEBUG] repmgrd is running, can not start another one.
[2023-02-10 11:32:37] [DEBUG] the thread 428402432 is still running
ping: socket: 不允许的操作 or  Operation not permitted
[2023-02-10 11:32:38] [NOTICE] 
[2023-02-10 11:32:38] [WARNING] ping host"10.10.10.1" failed
[2023-02-10 11:32:38] [DETAIL] average RTT value is not greater than zero
[2023-02-10 11:32:38] [DEBUG] ping process end early. usleep(1978998).
ping: socket: 不允许的操作 or  Operation not permitted

注意:前面的日志记录只是示例。 日期、时间和环境变量可能因不同环境而异。

二、排查过程:

根据KingbaseES数据库服务连续性运维:https://help.kingbase.com.cn/v8/admin/general/maintenance/maintenance-1.html

在数据库集群出现故障、计划外的停机时,通过hamgr.log、kbha.log日志定位故障原因。

1. hamgr.log日志:

# 节点1 主库hamgr日志
[2023-02-10 11:37:58] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2023-02-10 11:37:58] [INFO] connecting to database "host=10.10.10.20 user=esrep dbname=esrep port=5432 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
INFO:  set_repmgrd_pid(): provided pidfile is /home/kingbase/cluster/kingbase/etc/hamgrd.pid
[2023-02-10 11:37:58] [NOTICE] starting monitoring of node "node20" (ID: 1)
[2023-02-10 11:37:58] [INFO] "connection_check_type" set to "mix"
[2023-02-10 11:37:58] [NOTICE] monitoring cluster primary "node20" (ID: 1)
[2023-02-10 11:37:59] [INFO] child node "node21" (ID: 2) is attached
[2023-02-10 11:38:22] [NOTICE] TERM signal received
[2023-02-10 11:38:22] [ERROR] unable to determine if server is in recovery
[2023-02-10 11:38:22] [DETAIL] 
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

[2023-02-10 11:38:22] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2023-02-10 11:38:22] [INFO] repmgrd terminating...

# 节点2 备库hamgr日志
[2023-02-10 11:38:00] [NOTICE] starting monitoring of node "node21" (ID: 2)
[2023-02-10 11:38:00] [INFO] "connection_check_type" set to "mix"
[2023-02-10 11:38:00] [INFO] monitoring connection to upstream node "node20" (ID: 1)
[2023-02-10 11:38:00] [NOTICE] try to change wal catched_up state to 1
[2023-02-10 11:38:00] [INFO] primary flush lsn is AE/DF000590, local flush lsn is AE/DF000590
[2023-02-10 11:38:00] [NOTICE] try to change streaming_sync state to TRUE
[2023-02-10 11:38:23] [WARNING] unable to ping "host=10.10.10.20 user=esrep dbname=esrep port=5432 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-02-10 11:38:23] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-02-10 11:38:23] [WARNING] unable to connect to upstream node "node20" (ID: 1)
[2023-02-10 11:38:23] [INFO] checking state of node "node20" (ID: 1), 1 of 10 attempts
[2023-02-10 11:38:23] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=10.10.10.20 port=5432 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
...
[2023-02-10 11:39:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-02-10 11:39:46] [INFO] sleeping up to 6 seconds until next reconnection attempt
[2023-02-10 11:39:52] [INFO] checking state of node "node20" (ID: 1), 10 of 10 attempts
[2023-02-10 11:39:52] [WARNING] unable to reconnect to node "node20" (ID: 1) after 10 attempts
[2023-02-10 11:39:52] [NOTICE] repmgrd on this node is paused
[2023-02-10 11:39:52] [DETAIL] no failover will be carried out
[2023-02-10 11:39:52] [HINT] execute "repmgr service unpause" to resume normal failover mode
[2023-02-10 11:43:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-02-10 11:43:02] [INFO] node "node21" (ID: 2) monitoring upstream node "node20" (ID: 1) in degraded state
[2023-02-10 11:43:02] [DETAIL] repmgrd paused by administrator
[2023-02-10 11:43:02] [HINT] execute "repmgr service unpause" to resume normal failover mode

2. 通过hamgr.log日志信息可知:

主节点在[2023-02-10 11:38:22]发生故障异常宕机,可用备节点在[2023-02-10 11:38:23]发现主节点不能进行访问,在尝试连接超过reconnect_attempts 次阈值后,正常应该进行自动故障转移,但是备节点hamgr信息[2023-02-10 11:39:52]显示节点上repmgrd服务暂停,不会进行故障转移。

3. kbha.log日志:

# 备节点 kbha.log日志
--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1019ms
rtt min/avg/max/mdev = 0.729/0.737/0.745/0.008 ms

[2023-02-10 10:31:34] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID
[2023-02-10 10:31:34] [DEBUG] repmgrd is running, can not start another one.
[2023-02-10 10:31:37] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.

--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.729/0.745/0.761/0.016 ms
...
[2023-02-10 11:32:37] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID
[2023-02-10 11:32:37] [DEBUG] repmgrd is running, can not start another one.
[2023-02-10 11:32:37] [DEBUG] the thread 428402432 is still running
ping: socket: 不允许的操作 or  Operation not permitted
[2023-02-10 11:32:38] [NOTICE] 
[2023-02-10 11:32:38] [WARNING] ping host"10.10.10.1" failed
[2023-02-10 11:32:38] [DETAIL] average RTT value is not greater than zero
[2023-02-10 11:32:38] [DEBUG] ping process end early. usleep(1978998).
ping: socket: 不允许的操作 or  Operation not permitted

4. 通过可用备节点kbha.log日志信息可知:

备用节点在[2023-02-10 10:31:37]期间,可以正常使用ping命令检查集群网关可用性,但是在[2023-02-10 11:32:37]之后,ping命令提示ping: socket: 不允许的操作 or Operation not permitted,出现此问题的原因是由于ping命令权限被修改导致。ping命令在运行中采用了ICMP协议,需要发送ICMP报文。但是只有root用户才能建立ICMP报文。ping命令的权限正确的应该是-rwsr-xr-x,即带有suid的文件,一旦该权限被修改,则普通用户无法正常使用该命令。(在CentOS Linux中ping命令的权限为-rwxr-xr-x普通用户是可以正常使用的,但是发生故障的环境操作系统是Kylin linux,ping命令的权限为-rwxr-xr-x普通用户无法正常使用)。

$ ls -l /bin/ping 
-rwxr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

三、故障原因:

1. KingbaseES数据库主备切换过程:

  1. KingbaseES数据库集群当集群主节点发生故障时,备节点启动repmrd进程进行监控、检测,当检测到上游节点连接错误,会重试reconnect_attempts次,重试间隔reconnect_interval秒。
  2. KingbaseES数据库集群判断主节点是否故障时,通过网络连接超时,才能判断主节点故障。重试reconnect_attempts次后判断上游节点确定故障,然后判断上游节点是主节点还是备节点。
  3. 当KingbaseES确认主节点故障后,备节点杀掉wal_receiver进程(所有的备节点都会杀掉wal_receiver进程),开始进行升主操作。集群通过可用的备节点选择需要提升为主节点的备节点,执行升主语句,升主成功(failover切换成功)。
  4. 切换过程:假设本地节点选举成功,检测信任网关(如果配置了vip,会执行卸载集群旧的主节点vip, 并在集群新的主节点加载vip的操作,这个操作会使用到ping命令并且会ping两次vip),然后执行真正的升主语句,升主成功(failover切换成功)。

2. 故障具体原因:

通过以上KingbaseES数据库主备切换过程及kbha.log日志确定问题是由于ping命令权限被修改导致。ping命令在运行中采用ICMP协议,需要发送ICMP报文,但只有root用户才能建立ICMP报文。

正常情况下,ping命令的权限应为-rwsr-xr-x,即带有suid的文件,一旦该权限被修改,则普通用户无法正常使用该命令。

$ ls -l /bin/ping 
-rwxr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

三、解决方法:

使用root用户执行以下命令:

考虑到Linux发行版的种类较多,建议在部署KingbaseES集群时,检查修改ping命令的权限为-rwsr-xr-x,保证发生故障时不会由于ping命令权限导致自动故障转移失败。

# 以下命令选择执行其中一条就可以
chmod u+x /bin/ping
# 或者
chmod 4755 /bin/ping

# 权限正确的ping
$ ls -l /bin/ping 
-rwsr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

再次验证,手动关闭主节点,可以正常完成数据库故障自动切换。

标签:02,10,Database,ping,Failed,2023,11,Automatic,节点
From: https://www.cnblogs.com/nwwhile/p/17142071.html

相关文章