标签：02 10 Database ping Failed 2023 11 Automatic 节点

KingbaseES V8R6版本数据库自动故障转移失败(Automatic Database Failover Failed)

适用于：

KingbaseES V8R6 版本。

repmgr配置信息：

首先检查repmgr.conf配置文件，确任数据库主节点，数据库备节点参数：failover='automatic'、recovery='standby'一致

一、故障现象：

数据库自动故障转移失败，也就是failover切换失败。
其他的正常可用的备节点未被选择（切换）成为新的主数据节点。
在KingbaseES数据库可用备节点hamgr.log日志，可以看到类似于以下的信息条目。

[2023-02-10 11:39:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-02-10 11:39:46] [INFO] sleeping up to 6 seconds until next reconnection attempt
[2023-02-10 11:39:52] [INFO] checking state of node "node20" (ID: 1), 10 of 10 attempts
[2023-02-10 11:39:52] [WARNING] unable to reconnect to node "node20" (ID: 1) after 10 attempts
[2023-02-10 11:39:52] [NOTICE] repmgrd on this node is paused
[2023-02-10 11:39:52] [DETAIL] no failover will be carried out
[2023-02-10 11:39:52] [HINT] execute "repmgr service unpause" to resume normal failover mode

在KingbaseES数据库可用备节点kbha.log日志，可以看到类似于以下的信息条目。

[2023-02-10 10:31:34] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID
[2023-02-10 10:31:34] [DEBUG] repmgrd is running, can not start another one.
[2023-02-10 10:31:37] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.

--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.729/0.745/0.761/0.016 ms
...
[2023-02-10 11:32:37] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID
[2023-02-10 11:32:37] [DEBUG] repmgrd is running, can not start another one.
[2023-02-10 11:32:37] [DEBUG] the thread 428402432 is still running
ping: socket: 不允许的操作 or  Operation not permitted
[2023-02-10 11:32:38] [NOTICE] 
[2023-02-10 11:32:38] [WARNING] ping host"10.10.10.1" failed
[2023-02-10 11:32:38] [DETAIL] average RTT value is not greater than zero
[2023-02-10 11:32:38] [DEBUG] ping process end early. usleep(1978998).
ping: socket: 不允许的操作 or  Operation not permitted

注意：前面的日志记录只是示例。日期、时间和环境变量可能因不同环境而异。

二、排查过程：

根据KingbaseES数据库服务连续性运维：https://help.kingbase.com.cn/v8/admin/general/maintenance/maintenance-1.html

在数据库集群出现故障、计划外的停机时，通过hamgr.log、kbha.log日志定位故障原因。

1. hamgr.log日志：

# 节点1 主库hamgr日志
[2023-02-10 11:37:58] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2023-02-10 11:37:58] [INFO] connecting to database "host=10.10.10.20 user=esrep dbname=esrep port=5432 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
INFO:  set_repmgrd_pid(): provided pidfile is /home/kingbase/cluster/kingbase/etc/hamgrd.pid
[2023-02-10 11:37:58] [NOTICE] starting monitoring of node "node20" (ID: 1)
[2023-02-10 11:37:58] [INFO] "connection_check_type" set to "mix"
[2023-02-10 11:37:58] [NOTICE] monitoring cluster primary "node20" (ID: 1)
[2023-02-10 11:37:59] [INFO] child node "node21" (ID: 2) is attached
[2023-02-10 11:38:22] [NOTICE] TERM signal received
[2023-02-10 11:38:22] [ERROR] unable to determine if server is in recovery
[2023-02-10 11:38:22] [DETAIL] 
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

[2023-02-10 11:38:22] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2023-02-10 11:38:22] [INFO] repmgrd terminating...

# 节点2 备库hamgr日志
[2023-02-10 11:38:00] [NOTICE] starting monitoring of node "node21" (ID: 2)
[2023-02-10 11:38:00] [INFO] "connection_check_type" set to "mix"
[2023-02-10 11:38:00] [INFO] monitoring connection to upstream node "node20" (ID: 1)
[2023-02-10 11:38:00] [NOTICE] try to change wal catched_up state to 1
[2023-02-10 11:38:00] [INFO] primary flush lsn is AE/DF000590, local flush lsn is AE/DF000590
[2023-02-10 11:38:00] [NOTICE] try to change streaming_sync state to TRUE
[2023-02-10 11:38:23] [WARNING] unable to ping "host=10.10.10.20 user=esrep dbname=esrep port=5432 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-02-10 11:38:23] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-02-10 11:38:23] [WARNING] unable to connect to upstream node "node20" (ID: 1)
[2023-02-10 11:38:23] [INFO] checking state of node "node20" (ID: 1), 1 of 10 attempts
[2023-02-10 11:38:23] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=10.10.10.20 port=5432 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
...
[2023-02-10 11:39:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-02-10 11:39:46] [INFO] sleeping up to 6 seconds until next reconnection attempt
[2023-02-10 11:39:52] [INFO] checking state of node "node20" (ID: 1), 10 of 10 attempts
[2023-02-10 11:39:52] [WARNING] unable to reconnect to node "node20" (ID: 1) after 10 attempts
[2023-02-10 11:39:52] [NOTICE] repmgrd on this node is paused
[2023-02-10 11:39:52] [DETAIL] no failover will be carried out
[2023-02-10 11:39:52] [HINT] execute "repmgr service unpause" to resume normal failover mode
[2023-02-10 11:43:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-02-10 11:43:02] [INFO] node "node21" (ID: 2) monitoring upstream node "node20" (ID: 1) in degraded state
[2023-02-10 11:43:02] [DETAIL] repmgrd paused by administrator
[2023-02-10 11:43:02] [HINT] execute "repmgr service unpause" to resume normal failover mode

2. 通过hamgr.log日志信息可知：

主节点在[2023-02-10 11:38:22]发生故障异常宕机，可用备节点在[2023-02-10 11:38:23]发现主节点不能进行访问，在尝试连接超过reconnect_attempts 次阈值后，正常应该进行自动故障转移，但是备节点hamgr信息[2023-02-10 11:39:52]显示节点上repmgrd服务暂停，不会进行故障转移。

3. kbha.log日志：

# 备节点 kbha.log日志
--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1019ms
rtt min/avg/max/mdev = 0.729/0.737/0.745/0.008 ms

[2023-02-10 10:31:34] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID
[2023-02-10 10:31:34] [DEBUG] repmgrd is running, can not start another one.
[2023-02-10 10:31:37] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.

--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.729/0.745/0.761/0.016 ms
...
[2023-02-10 11:32:37] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID
[2023-02-10 11:32:37] [DEBUG] repmgrd is running, can not start another one.
[2023-02-10 11:32:37] [DEBUG] the thread 428402432 is still running
ping: socket: 不允许的操作 or  Operation not permitted
[2023-02-10 11:32:38] [NOTICE] 
[2023-02-10 11:32:38] [WARNING] ping host"10.10.10.1" failed
[2023-02-10 11:32:38] [DETAIL] average RTT value is not greater than zero
[2023-02-10 11:32:38] [DEBUG] ping process end early. usleep(1978998).
ping: socket: 不允许的操作 or  Operation not permitted

4. 通过可用备节点kbha.log日志信息可知：

备用节点在[2023-02-10 10:31:37]期间，可以正常使用ping命令检查集群网关可用性，但是在[2023-02-10 11:32:37]之后，ping命令提示ping: socket: 不允许的操作 or Operation not permitted，出现此问题的原因是由于ping命令权限被修改导致。ping命令在运行中采用了ICMP协议，需要发送ICMP报文。但是只有root用户才能建立ICMP报文。ping命令的权限正确的应该是-rwsr-xr-x，即带有suid的文件，一旦该权限被修改，则普通用户无法正常使用该命令。（在CentOS Linux中ping命令的权限为-rwxr-xr-x普通用户是可以正常使用的，但是发生故障的环境操作系统是Kylin linux，ping命令的权限为-rwxr-xr-x普通用户无法正常使用）。

$ ls -l /bin/ping 
-rwxr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

三、故障原因：

1. KingbaseES数据库主备切换过程：

KingbaseES数据库集群当集群主节点发生故障时，备节点启动repmrd进程进行监控、检测，当检测到上游节点连接错误，会重试reconnect_attempts次，重试间隔reconnect_interval秒。
KingbaseES数据库集群判断主节点是否故障时，通过网络连接超时，才能判断主节点故障。重试reconnect_attempts次后判断上游节点确定故障，然后判断上游节点是主节点还是备节点。
当KingbaseES确认主节点故障后，备节点杀掉wal_receiver进程（所有的备节点都会杀掉wal_receiver进程），开始进行升主操作。集群通过可用的备节点选择需要提升为主节点的备节点，执行升主语句，升主成功（failover切换成功）。
切换过程：假设本地节点选举成功，检测信任网关（如果配置了vip，会执行卸载集群旧的主节点vip，并在集群新的主节点加载vip的操作，这个操作会使用到ping命令并且会ping两次vip），然后执行真正的升主语句，升主成功（failover切换成功）。

2. 故障具体原因：

通过以上KingbaseES数据库主备切换过程及kbha.log日志确定问题是由于ping命令权限被修改导致。ping命令在运行中采用ICMP协议，需要发送ICMP报文，但只有root用户才能建立ICMP报文。

正常情况下，ping命令的权限应为-rwsr-xr-x，即带有suid的文件，一旦该权限被修改，则普通用户无法正常使用该命令。

$ ls -l /bin/ping 
-rwxr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

三、解决方法：

使用root用户执行以下命令：

考虑到Linux发行版的种类较多，建议在部署KingbaseES集群时，检查修改ping命令的权限为-rwsr-xr-x，保证发生故障时不会由于ping命令权限导致自动故障转移失败。

# 以下命令选择执行其中一条就可以
chmod u+x /bin/ping
# 或者
chmod 4755 /bin/ping

# 权限正确的ping
$ ls -l /bin/ping 
-rwsr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

再次验证，手动关闭主节点，可以正常完成数据库故障自动切换。

标签：02,10,Database,ping,Failed,2023,11,Automatic,节点
From： https://www.cnblogs.com/nwwhile/p/17142071.html

KingbaseES数据库自动故障转移失败(Automatic Database Failover Failed)

KingbaseES V8R6版本数据库自动故障转移失败(Automatic Database Failover Failed)

适用于：

repmgr配置信息：

一、故障现象：

二、排查过程：

1. hamgr.log日志：

2. 通过hamgr.log日志信息可知：

3. kbha.log日志：

4. 通过可用备节点kbha.log日志信息可知：

三、故障原因：

1. KingbaseES数据库主备切换过程：

2. 故障具体原因：

三、解决方法：

相关文章

赞助商

阅读排行

KingbaseES数据库自动故障转移失败(Automatic Database Failover Failed)

KingbaseES V8R6版本 数据库自动故障转移失败(Automatic Database Failover Failed)

适用于：

repmgr配置信息：

一、故障现象：

二、排查过程：

1. hamgr.log日志：

2. 通过hamgr.log日志信息可知：

3. kbha.log日志：

4. 通过可用备节点kbha.log日志信息可知：

三、故障原因：

1. KingbaseES数据库主备切换过程：

2. 故障具体原因：

三、解决方法：

相关文章

赞助商

阅读排行

KingbaseES V8R6版本数据库自动故障转移失败(Automatic Database Failover Failed)