首页 > 其他分享 >KingbaseES V8R3 集群运维系列 -- failover切换后集群自动恢复

KingbaseES V8R3 集群运维系列 -- failover切换后集群自动恢复

时间:2023-05-09 19:44:05浏览次数:50  
标签:02 V8R3 运维 22 29 192.168 集群 2023 14

案例说明:
KingbaseES V8R3集群默认在触发failover切换后,为保证数据安全,原主库需要通过人工介入后,恢复为新的备库加入到集群。在无人值守的现场环境,需要在触发failover切换后,主库可以自动恢复为新备考加入集群,提升架构的高可用性。

适用版本:
KingbaseES V8R3

集群架构:

 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replicatio
n_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-----------
--------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | standby | 0          | true              | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | primary | 0          | false             | 0
(2 rows)

一、配置AUTO_PRIMARY_RECOVERY参数

Tips:
AUTO_PRIMARY_RECOVERY参数配置在HAmodule.conf文件中,需要修改db和kingbasecluster目录下相关配置文件。

[kingbase@node102 bin]$ cat ../etc/HAmodule.conf |grep -i auto
#automatic recovery log path.example:RECOVERY_LOG_DIR="./log/recovery.log"
#whether to turn on automatic recovery,0->off,1->on.example:AUTO_PRIMARY_RECOVERY="1"
AUTO_PRIMARY_RECOVERY=0

---如上所示,默认AUTO_PRIMARY_RECOVERY=0不支持主库在failover切换后,自动降为备库加入到集群。

如下图所示:配置主库自动恢复

二、failover切换测试

1、模拟主库数据库服务down

[kingbase@node102 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped

2、切换后集群节点状态

TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replicatio
n_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-----------
--------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 0          | true              | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | standby | 0          | false             | 0
(2 rows)

---如上所示,failover切换后,集群恢复正常,原主库(102)作为备库加入到集群。

3、主备流复制状态

TEST=# select * from sys_stat_replication;
  PID  | USESYSID | USENAME | APPLICATION_NAME |  CLIENT_ADDR  | CLIENT_HOSTNAME | CLIENT_PORT |         BACK
END_START         | BACKEND_XMIN |   STATE   | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCAT
ION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------
------------------+--------------+-----------+---------------+----------------+----------------+-------------
----+---------------+------------
 16942 |       10 | SYSTEM  | node2            | 192.168.1.102 |                 |       16773 | 2023-02-22 1
4:29:08.870998+08 |              | streaming | 0/D001FDF0    | 0/D001FDF0     | 0/D001FDF0     | 0/D001FDF0
    |             2 | sync
(1 row)

三、查看failover切换日志

如下所示,执行failover_stream.sh触发failover切换。

1、新主库failover.log

-----------------2023-02-22 14:28:13 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.101], %P = old primary node id [1], %d = node id[1], %h = host name [192.168.1.102], %O = old primary host[192.168.1.102] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
master down, let 192.168.1.101 become new primary.....
 2023-02-22 14:28:15 del old primary VIP on 192.168.1.102        
es_client connect host:192.168.1.102 success, will stop old primary db and del the vip
stop the old primary db
DEL VIP NOW AT 2023-02-22 14:28:15 ON enp0s3
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
execute: [/sbin/ip addr del 192.168.1.204/24 dev enp0s3]
Oprate del ip cmd end.
2023-02-22 14:28:15 add VIP on 192.168.1.101
ADD VIP NOW AT 2023-02-22 14:28:15 ON enp0s3
execute: [/sbin/ip addr add 192.168.1.204/24 dev enp0s3 label enp0s3:2]
execute: /home/kingbase/cluster/HAR3/db/bin//arping -U 192.168.1.204 -I enp0s3 -w 1
Success to send 1 packets
2023-02-22 14:28:15 promote begin...let 192.168.1.101 become master
check db if is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 kingbase is ok , to prepare execute promote
execute promote
server promoting
check db if is alive after promote
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"   -c "select 33333;"
2023-02-22 14:28:16 after execute promote , kingbase status is ok.
after execute promote, kingbase is ok.
2023-02-22 14:28:16 sync to async
ALTER SYSTEM
 SYS_RELOAD_CONF
-----------------
 t
(1 row)

2023-02-22 14:28:16 make checkpoint
check the db to see if it is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"  -c "select 33333;"
2023-02-22 14:28:16 kingbase is ok , to prepare execute checkpoint
execute checkpoint
CHECKPOINT
check the db to see if it is alive after execute checkpoint
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"   -c "select 33333;"
2023-02-22 14:28:16 after execute checkpoint, kingbase is ok.
after execute checkpoint, kingbase is ok.
-----------------2023-02-22 14:28:16 failover end---------------------------------------

2、原主库recovery.log
如下所示,在failover切换后,通过sys_rewind将原主库恢复为备库,并加入到集群。

---------------------------------------------------------------------
2023-02-22 14:29:01 recover beging...
my pid is 21729,officially began to perform recovery
2023-02-22 14:29:01 check read/write on mount point
2023-02-22 14:29:01 check read/write on mount point (1 / 6).
2023-02-22 14:29:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
2023-02-22 14:29:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ... OK
2023-02-22 14:29:01 create/write the file "/home/kingbase/cluster/HAR3/db/data/rw_status_file_625758242" ...
........
2023-02-22 14:29:01 success to check read/write on mount point (1 / 6).
2023-02-22 14:29:01 check read/write on mount point ... ok
2023-02-22 14:29:01 check if the network is ok
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
determine if i am master or standby
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 0          | true              | 0
 1       | 192.168.1.102 | 54321 | down   | 0.500000  | standby | 0          | false             | 0
(2 rows)

i am standby in cluster,determine if recovery is needed
2023-02-22 14:29:03 now will del vip [192.168.1.204/24]
now, there is no 192.168.1.204/24 on my DEV
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
primary node/Im node status is changed, primary ip[192.168.1.101], recovery.conf NEED_CHANGE [1] (0 is need ), I,m status is [2] (1 is down), I will be in recovery.
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 0          | true              | 0
 1       | 192.168.1.102 | 54321 | down   | 0.500000  | standby | 0          | false             | 0
(2 rows)

if recover node up, let it down , for rewind
2023-02-22 14:29:03 sys_rewind...
sys_rewind  --target-data=/home/kingbase/cluster/HAR3/db/data --source-server="host=192.168.1.101 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
datadir_source = /home/kingbase/cluster/HAR3/db/data
rewinding from last common checkpoint at 0/CF000028 on timeline 4
find last common checkpoint start time from 2023-02-22 14:29:03.926782 CST to 2023-02-22 14:29:03.985859 CST, in "0.059077" seconds.
reading source file list
reading target file list
reading WAL in target
Rewind datadir file from source
Get archive xlog list from source
Rewind archive log from source
update the control file: minRecoveryPoint is '0/D001F0B0', minRecoveryPointTLI is '5', and database state is 'in archive recovery'
rewind start wal location 0/CF000028 (file 0000000400000000000000CF), end wal location 0/D001F0B0 (file 0000000500000000000000D0). time from 2023-02-22 14:29:05.926782 CST to 2023-02-22 14:29:06.184927 CST, in "2.258145" seconds.
Done!
 sed conf change #synchronous_standby_names
2023-02-22 14:29:08 file operate
cp recovery.conf...
 change recovery.conf ip -> primary.ip
2023-02-22 14:29:08 no need change recovery.conf, primary node is 192.168.1.101
delete pid file if exist
del the replication_slots if exist
drop the slot [slot_node1].
drop the slot [slot_node2].
2023-02-22 14:29:08 start up the kingbase...
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "/home/kingbase/cluster/HAR3/db/data/sys_log".
 done
server started
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10"  -c "select 33333;"
 SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
 (slot_node1,)
(1 row)

2023-02-22 14:29:10 create the slot [slot_node1] success.
 SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
 (slot_node2,)
(1 row)

2023-02-22 14:29:10 create the slot [slot_node2] success.
2023-02-22 14:29:10 start up standby successful!
cluster is sync cluster.
SYNC RECOVER MODE ...
2023-02-22 14:29:10 remote primary node change sync
ALTER SYSTEM
 SYS_RELOAD_CONF
-----------------
 t
(1 row)

SYNC RECOVER MODE DONE
2023-02-22 14:29:13 attach pool...
IM Node is 1, will try [pcp_attach_node -U kingbase -W MTIzNDU2 -h 192.168.1.205 -n 1]
pcp_attach_node -- Command Successful
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 0          | true              | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | standby | 0          | false             | 0
(2 rows)

2023-02-22 14:29:14 attach end..
recovery success,exit script with success
---------------------------------------------------------------------

---如上所示,原主库在failover切换后,触发auto-recovery,被恢复为新的备库加入到集群。

标签:02,V8R3,运维,22,29,192.168,集群,2023,14
From: https://www.cnblogs.com/kingbase/p/17273585.html

相关文章

  • 分享一个提高运维效率的 Python 脚本
    哈喽大家好我是咸鱼,今天给大家分享一个能够提升运维效率的python脚本咸鱼平常在工作当中通常会接触到下面类似的场景:容灾切换的时候批量对机器上的配置文件内容进行修改替换对机器批量替换某个文件中的字段对于Linux机器,咸鱼可以写个shell脚本或者直接批量使用sed命......
  • PostgreSQL-HA 高可用集群在 Rainbond 上的部署方案
    PostgreSQL是一种流行的开源关系型数据库管理系统。它提供了标准的SQL语言接口用于操作数据库。repmgr是一个用于PostgreSQL数据库复制管理的开源工具。它提供了自动化的复制管理,包括:故障检测和自动故障切换:repmgr可以检测到主服务器故障并自动切换到备用服务器。自动故......
  • kubernetes集群故障处理思路
    kubernetes集群故障处理思路1.0 概述本文主要介绍了pod的创建流程,一般问题处理思路以及一些辅助的检查项,以编快速定位及解决问题1.1 kubernetes故障处理思路概览图k8s集群出现故障,一般可以从两大种情况看,个别pod异常和大量pod异常,针对两种情况,可以参考以下处理思路1.2 po......
  • python操作redis集群
    1基础环境分析redis版本:redis-5.0.14 2脚本示例说明:向redis集群里面写入10000000条数据,再查询下这些数据#!/usr/bin/python3importrandomimportstringimporthashlibfromredisclusterimportRedisClusterimportdatetime"""使用redis的方式向redis集群中写入......
  • OceanBase集群缩容
    连接sys租户使用root用户连接sys租户的oceanbase数据库[admin]#mysql-h192.168.1.71-P2883-Doceanbase-uroot@sys-pOceanbase2881-A租户减少副本查看租户信息,sys、my_tenat租户(sys是系统租户,my_tenant是我创建的测试租户)的LOCALITY值包括:zone1、zone2、zone3mysql>s......
  • OceanBase集群扩容
    连接sys租户使用root用户连接sys租户的oceanbase数据库[admin]#mysql-h192.168.1.71-P2883-Doceanbase-uroot@sys-pOceanbase2881-A添加zone查看zone信息,当前OceanBase集群有2个zonemysql>select*fromoceanbase.dba_ob_zones;+-------+------------------------......
  • Nacos 使用 CentOS7 进行集群部署
    有了上篇博客的Nacos单机部署经验,对于集群搭建就容易多了。要想搭建Nacos集群,至少需要3个节点。为了统一访问地址,因此需要使用nginx进行转发。本篇博客仍然采用Nacos当前最新的版本2.2.2进行集群搭建。Nacos官网上也有集群搭建的说明,不过有点简略。Nacos的集群搭......
  • skywalking 集群部署
    1、安装包:apache-skywalking-apm-es7-8.7.0.tar.gz下载地址:https://archive.apache.org/dist/skywalking/8.7.0/apache-skywalking-apm-es7-8.7.0.tar.gz2、安装包解压tar -xf  apache-skywalking-apm-es7-8.7.0.tar.gzyuminstalljava-1.8.0-openjdk-devel-y  ......
  • 搭建 Kubernetes 集群
    简介Kubernetes是一个开源系统,用于容器化应用的自动部署、扩缩和管理。它将构成应用的容器按逻辑单位进行分组以便于管理和发现。搭建环境:CentOSv7.6.1810docker-ce-versionv23.0.5kubernetes-versionv1.23.6本次使用2台服务器进行搭建,运行下面命令写入/etc/hosts文件(......
  • Hbase跨集群迁移以及常用命令
    场景:由于Hbase版本升级以及集群切换,现需要将Hbase从A集群(源)迁移至B集群(目的)迁移过程:将源A集群的Hbase需要迁移的表(注意namespace)通过snapshot方式打成快照,然后再通过ExportSnapshot方式迁移至目的B集群,此时目的集群的HDFS目录下的hbase目录会生成.hbase_snapshot和archive目录......