首页 > 数据库 >KingbaseES V8R3集群运维案例之---主库数据库服务down后failover切换详解

KingbaseES V8R3集群运维案例之---主库数据库服务down后failover切换详解

时间:2023-09-18 16:27:09浏览次数:40  
标签:主库 02 V8R3 07 05 failover 192.168 down 2023

案例说明:
对KingbaseES V8R3集群,主库数据库服务down后,failover切换进行分析,详解其执行切换的过程,本案例可用于对KingbaseES V8R3集群failover故障的分析参考。

适用版本:
KingbaseES V8R3

集群架构:

 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.333333  | standby | 0          | true              | 0
 1       | 192.168.1.102 | 54321 | up     | 0.333333  | primary | 0          | false             | 0
 2       | 192.168.1.103 | 54321 | up     | 0.333333  | standby | 0          | false             | 0

一、主库数据库服务down
[kingbase@node102 bin]$ ./sys_ctl stop -D ../data

二、查看master节点cluster.log

1、此节点为kingbasecluster的master节点(一般和流复制的primary节点在同一节点)

2023-05-07 02:18:21: pid 11666: WARNING:  checking setuid bit of arping command
2023-05-07 02:18:21: pid 11666: DETAIL:  arping[/home/kingbase/cluster/HAR3/db/bin//arping] doesn't have setuid bit
2023-05-07 02:18:21: pid 11666: LOG:  Backend status file /home/kingbase/cluster/HAR3/run/kingbasecluster/kingbasecluster_status does not exist
......
2023-05-07 02:18:22: pid 11706: LOG:  watchdog node state changed from [INITIALIZING] to [STANDING FOR MASTER]

2、检测到和主库数据库的health checking失败次数达到阈值(HEALTH_CHECK_MAX_RETRIES=6)。

2023-05-07 02:23:24: pid 11666: LOG:  health checking retry count 1
2023-05-07 02:23:24: pid 11666: LOG:  failed to connect to kingbase server on "192.168.1.102:54321", getsockopt() detected error "Connection refused"
.......
2023-05-07 02:24:04: pid 11666: LOG:  health checking retry count 5
2023-05-07 02:24:04: pid 11666: LOG:  failed to connect to kingbase server on "192.168.1.102:54321", getsockopt() detected error "Connection refused"
2023-05-07 02:24:04: pid 11666: ERROR:  failed to make persistent db connection
2023-05-07 02:24:04: pid 11666: DETAIL:  connection to host:"192.168.1.102:54321" failed

3、执行failover切换前,master节点需要只有failover lock;从kingbasecluster的standby节点接收到failover lock request,默认只有master节点可以持有failover lock 。

2023-05-07 02:24:14: pid 11706: LOG:  received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-05-07 02:24:14: pid 11706: LOG:  remote kingbasecluster node "192.168.1.102:9999 Linux node102" is requesting to become a lock holder for failover ID: 0
2023-05-07 02:24:14: pid 11706: LOG:  request to become a lock holder is denied to remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-05-07 02:24:14: pid 11706: DETAIL:  only master/coordinator can become a lock holder
2023-05-07 02:24:14: pid 11666: LOG:  Kingbasecluster-II parent process has received failover request

4、master节点执行failover_stream.sh脚本触发failover切换。(在此时间点可以在failover.log看到详细的failover切换过程)

2023-05-07 02:24:14: pid 11666: LOG:  execute command: /home/kingbase/cluster/HAR3/kingbasecluster/bin/failover_stream.sh 192.168.1.101 1 1 192.168.1.102 192.168.1.102 0 0 /home/kingbase/cluster/HAR3/db/data
2023-05-07 02:24:14: pid 11706: LOG:  received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-05-07 02:24:14: pid 11706: LOG:  remote kingbasecluster node "192.168.1.102:9999 Linux node102" is checking the status of [FAILOVER] lock for failover ID 0
2023-05-07 02:24:14: pid 11706: LOG:  FAILOVER lock is currently LOCKED

5、切换完成后,将其他备库recovery,连接到新的主库节点。

2023-05-07 02:25:28: pid 11706: LOG:  received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-05-07 02:25:28: pid 11706: LOG:  remote kingbasecluster node "192.168.1.102:9999 Linux node102" is checking the status of [FAILOVER] lock for failover ID 55
2023-05-07 02:25:28: pid 11706: LOG:  FAILOVER lock is currently LOCKED
.......
2023-05-07 02:25:45: pid 11666: LOG:  starting fail back. reconnect host 192.168.1.103(54321)

三、master节点的failover.log

Tips:
日志信息记录的时间点和在cluster.log日志中记录的执行failover_stream.sh的时间点相同,两个日志可以配合查询。

1、执行failover切换

-----------------2023-05-07 02:24:14 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.101], %P = old primary node id [1], %d = node id[1], %h = host name [192.168.1.102], %O = old primary host[192.168.1.102] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
master down, let 192.168.1.101 become new primary.....
 2023-05-07 02:24:16 del old primary VIP on 192.168.1.102
es_client connect host:192.168.1.102 success, will stop old primary db and del the vip
stop the old primary db
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
DEL VIP NOW AT 2023-05-07 02:24:02 ON enp0s3
execute: [/sbin/ip addr del 192.168.1.204/24 dev enp0s3]
Oprate del ip cmd end.
2023-05-07 02:24:16 add VIP on 192.168.1.101
ADD VIP NOW AT 2023-05-07 02:24:17 ON enp0s3
execute: [/sbin/ip addr add 192.168.1.204/24 dev enp0s3 label enp0s3:2]
execute: /home/kingbase/cluster/HAR3/db/bin//arping -U 192.168.1.204 -I enp0s3 -w 1
Success to send 1 packets
2023-05-07 02:24:17 promote begin...let 192.168.1.101 become master
check db if is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-05-07 02:24:17 kingbase is ok , to prepare execute promote
execute promote
server promoting
check db if is alive after promote
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"   -c "select 33333;"
2023-05-07 02:24:17 after execute promote , kingbase status is ok.
after execute promote, kingbase is ok.
2023-05-07 02:24:17 sync to async
ALTER SYSTEM
 SYS_RELOAD_CONF
-----------------
 t
(1 row)

2023-05-07 02:24:17 make checkpoint
check the db to see if it is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"  -c "select 33333;"
2023-05-07 02:24:17 kingbase is ok , to prepare execute checkpoint
execute checkpoint
CHECKPOINT
check the db to see if it is alive after execute checkpoint
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"   -c "select 33333;"
2023-05-07 02:24:17 after execute checkpoint, kingbase is ok.
after execute checkpoint, kingbase is ok.
-----------------2023-05-07 02:24:17 failover end---------------------------------------

2、执行其他备库节点的recovery

-----------------2023-05-07 02:25:28 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.101], %P = old primary node id [0], %d = node id[2], %h = host name [192.168.1.103], %O = old primary host[192.168.1.101] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
standby down, master still 192.168.1.101
The sys_stat_replication view result is : []
2023-05-07 02:25:30 sync to async
ALTER SYSTEM
 SYS_RELOAD_CONF
-----------------
 t
(1 row)

-----------------2023-05-07 02:25:30 failover end---------------------------------------

四、standby 节点的cluster.log

1、此节点为kingbasecluster的standby节点(一般和流复制的standby节点在同一节点)

---- Sun May 7 02:18:06 CST 2023 monitor up ----
2023-05-07 02:18:06: pid 31862: WARNING:  checking setuid bit of arping command
.......
2023-05-07 02:18:07: pid 31886: LOG:  setting the remote node "192.168.1.101:9999 Linux node101" as watchdog cluster master
2023-05-07 02:18:08: pid 31886: LOG:  watchdog node state changed from [INITIALIZING] to [STANDBY]
2023-05-07 02:18:08: pid 31886: LOG:  successfully joined the watchdog cluster as standby node
.......

2、检测到和主库数据库的health checking达到阈值(HEALTH_CHECK_MAX_RETRIES=6)

2023-05-07 02:23:09: pid 31862: LOG:  health checking retry count 1
2023-05-07 02:23:09: pid 31862: LOG:  failed to connect to kingbase server on "192.168.1.102:54321", getsockopt() detected error "Connection refused"
2023-05-07 02:23:09: pid 31862: ERROR:  failed to make persistent db connection
2023-05-07 02:23:09: pid 31862: DETAIL:  connection to host:"192.168.1.102:54321" failed
.......
2023-05-07 02:23:59: pid 31862: LOG:  health checking retry count 6
2023-05-07 02:23:59: pid 31862: LOG:  failed to connect to kingbase server on "192.168.1.102:54321", getsockopt() detected error "Connection refused"
........

3、kingbasecluster的standby节点,向master节点发出持有failover lock的request,等待master节点的响应。

2023-05-07 02:23:59: pid 31886: LOG:  failover request from local kingbasecluster node received on IPC interface is forwarded to master watchdog node "192.168.1.101:9999 Linux node101"
2023-05-07 02:23:59: pid 31886: DETAIL:  waiting for the reply...
.......
2023-05-07 02:25:16: pid 31886: LOG:  failover command lock request from local kingbasecluster node received on IPC interface is forwarded to master watchdog node "192.168.1.101:9999 Linux node101"
2023-05-07 02:25:16: pid 31886: DETAIL:  waiting for the reply...

3、failover切换完成后,将其他备库节点recovery连接到新的主库节点。

failover done. shutdown host 192.168.1.103(54321)2023-05-07 02:25:16: pid 31862: LOG:  failover done. shutdown host 192.168.1.103(54321)
.......
2023-05-07 02:25:30: pid 31862: LOG:  failback done. reconnect host 192.168.1.103(54321)

五、总结

KingbaseES V8R3集群failover切换流程:
1、集群启动后,kingbasecluster服务选举master节点和standby节点,master和standby节点之间通过watchdog传递心跳。
2、当master和standby节点,检测到主库的数据库服务(healthy check)次数超过阈值后,触发failover切换。
3、failover切换前,master节点需要持有failover lock。如果是主库主机down或重启,kingbasecluster的standby节点将切换为master,并获取failover lock。
4、master节点持有failover lock后,执行failover_stream.sh触发failover切换,如果master节点主机hang住,有可能导致无法执行failover_stream.sh,导致切换失败。
5、failover切换完成后,一个流复制备库节点切换为primary(默认管理备库节点),还将recovery其他备库节点到新主库。
6、failover切换过程可以从cluster.log和failover.log获取到详细的信息。
7、也只有获得锁的master KingbaseCluster可以进行选主,切换等操作。为standby的KingbaseCluster,当且仅当重新选举为新的master后,才会生效。

标签:主库,02,V8R3,07,05,failover,192.168,down,2023
From: https://www.cnblogs.com/kingbase/p/17561166.html

相关文章

  • KingbaseES V8R3集群运维案例---failover切换故障分析
    案例说明:KingbaseESV8R3集群主库数据库服务重启后,failover切换失败,分析failover失败的具体原因。适用版本:KingbaseESV8R3一、集群架构node13----->主库(primary)node25----->管理备库(standby)node58----->备库(standby)二、故障现象1主2备集群,172.31.*......
  • KingbaseES V8R3集群运维案例之---流复制异步同步及全同步模式配置
    案例说明:通过案例描述KingbaseESV8R3集群异步、同步及全同步强一致性配置,本案例为一主二备的架构。适用版本:KingbaseESV8R3集群架构:集群复制配置参数说明:1)sync_flag[kingbase@node101bin]$cat../etc/HAmodule.conf|grep-isync_#1->synchronouscluster,0->async......
  • kingbaseES V8R3集群运维案例之---集群部署前后ssh端口修改
    kingbaseESV8R3集群运维案例之---集群部署前后ssh端口修改案例说明:kingbaseESV8R3集群部署读写分离的集群是使用ssh的默认端口(22)部署,当改为非默认端口时,在部署中或部署后会因kingbasecluster脚本ssh的连接而失败,现提出以下解决方案。适用版本:KingbaseESV8R31......
  • day01-Markdown
    #Markdown##二级标题###三级标题##字体**hello,world!***hello,world!****hello,world!***~~hello,world!~~##引用>选择狂神说jiava,走向人生巅峰##分割线---***##图片![截图1](E:\java_kuangshen\My_Typora\1.png)###链接图![鬼刀](https://img1.baidu.......
  • day01_Markdown学习
    Markdown二级标题三级标题字体hello,world!hello,world!hello,world!hello,world!引用选择狂神说jiava,走向人生巅峰分割线图片链接图![鬼刀](https://img1.baidu.com/it/u=2371695736,2796537361&fm=253&fmt=auto&app=138&f=JPEG?w=885&h=500"style="width:......
  • K8S你学废了么5——Secret与downloadAPI
    一、背景介绍前面介绍的k8s中的pv存储卷与cm存储卷,k8s中还有2中特殊的存储卷:secret和downloadAPI。其作用分别是用来存放敏感信息和将pod中的信息暴漏给pod中运行的代码,这也是k8s中经常会用到的两个存储卷,下面就这两个存储卷展开详细说明。二、Secret存储卷尽管configMap资源也可以......
  • windows11右键集成新建markdown文件
    右键集成新建markdown文档在桌面新建文本文档输入以下内容WindowsRegistryEditorVersion5.00[HKEY_CLASSES_ROOT.md]@="Typora.exe"[HKEY_CLASSES_ROOT.md\ShellNew]"NullFile"=""[HKEY_CLASSES_ROOT\Typora.exe]@="Markdown"把后缀名改为.reg点击执行......
  • Markdown语法解释
    Markdown学习标题: 二级标题三级标题四级标题五级标题六级标题 字体Hello,World! Hello,World! Hello,World! Hello,World! 引用选择java 分割线 图片  超链接点击进入百度列表ABcABC表格名字性别生日......
  • markdown使用vscode生成目录
    要想用vscode生成目录,按照如下步骤执行即可:需要先安装插件MarkdownAllinone在想要插入目录的地方快捷键Ctrl+Shift+P然后在弹出的框里面输入TOC3.选择命令createtableofcontents就完成了目录生成如果想要导出为Html页面的话输入Ctrl+Shift+P然后在弹出的......
  • markdown画图组件-mermaid
    风险名单业务系统,查询,名单,核实数据数据来源导入-外部系统sequenceDiagram名单系统->>数据仓库:获取文件数据仓库-->>名单系统:返回文件名单系统->>名单系统:更新存量名单维护-本系统管理管理申诉抗辩sequenceDiagramparticipant信贷系统as调用......