首页 > 其他分享 >KingbaseES V8R6集群运维案例之---级联备库upstream节点故障

KingbaseES V8R6集群运维案例之---级联备库upstream节点故障

时间:2024-03-29 18:44:05浏览次数:28  
标签:11 10 备库 V8R6 运维 08 2023 24 keepalives

      KingbaseES V8R6集群运维案例之---级联备库upstream节点故障

案例说明:
在KingbaseES V8R6集群,构建级联备库后,在其upstream的节点故障后,级联备库如何处理?
适用版本:
KingbaseES V8R6

集群架构:

案例一:

一、配置集群的recovery参数(all nodes)

Tips: 关闭备库的aut-recovery机制

[kingbase@node102 bin]$ cat ../etc/repmgr.conf |grep -i recovery
recovery='manual'

二、查看当前集群状态

1、节点状态

[kingbase@node103 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node2    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

如下图所示:node3的upstream节点是node2

2、流复制状态

Tips:在node2节点查看流复制状态信息,node3和node2构建流复制。

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_
start         | backend_xmin |  state  |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn |   write_lag    |   f
lush_lag    |   replay_lag   | sync_priority | sync_state |          reply_time
-------+----------+---------+------------------+---------------+-----------------+-------------+-----------------
--------------+--------------+---------+------------+------------+------------+------------+----------------+----
------------+----------------+---------------+------------+-------------------------------
 21188 |       10 | system  | node3            | 192.168.1.103 |                 |       60130 | 2023-08-24 11:37
:16.149515+08 |              | catchup | 2/104C0000 | 2/103E0000 | 2/10320000 | 2/9C47A08  | 0:00:06.917596 | 0:0
0:06.917596 | 0:00:06.917596 |             0 | async      | 2023-08-24 11:37:24.604332+08
(1 row)

test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin  | catalog_xmin
| restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+--------------
+-------------+---------------------
 repmgr_slot_3 |        | physical  |        |          | f         | t      |      21188 | 60748 |
| 2/52AA0000  |
(1 row)

三、模拟upstream节点数据库服务故障

1、查看node3节点repmgrd进程

[kingbase@node103 bin]$ ps -ef |grep repmgrd
kingbase  4619     1  0 11:47 ?        00:00:02 ./repmgrd -d

2、关闭node2(upstream)节点数据库服务

[kingbase@node102 bin]$ ./sys_ctl stop -D /data/kingbase/c7/data/
........
server stopped

3、查看节点状态
如下图所示,node3的upstream节点更新为node1.

4、查看node3节点hamgr.log
如下日志所示,node3在连接node2失败超过阈值后,执行了‘repmgr standby follow’更新了upstream节点。

[2023-08-24 11:47:51] [INFO] checking state of node 2, 1 of 2 attempts
[2023-08-24 11:47:51] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.1.102 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2023-08-24 11:47:51] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-24 11:47:51] [INFO] sleeping 5 seconds until next reconnection attempt
[2023-08-24 11:47:56] [INFO] checking state of node 2, 2 of 2 attempts
[2023-08-24 11:47:56] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.1.102 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2023-08-24 11:47:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-24 11:47:56] [WARNING] unable to reconnect to node 2 after 2 attempts

# node3节点执行‘repmgr standby follow’更新upstream节点
[2023-08-24 11:47:56] [DEBUG] standby follow command is:
  "/home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/repmgr standby follow -f /home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/etc/repmgr.conf -W --upstream-node-id=%n"
[WARNING] following problems with command line parameters detected:
  --no-wait will be ignored when executing STANDBY FOLLOW
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.103 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[DEBUG] connecting to: "user=system connect_timeout=10 dbname=esrep host=192.168.1.101 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[INFO] timelines are same, this server is not ahead
[DETAIL] local node lsn is 2/FE169FB0, follow target lsn is 2/FE16A510
[INFO] creating replication slot as user "system"
[DEBUG] replication slot "repmgr_slot_3" exists but is inactive; reusing
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.103 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[NOTICE] setting node 3's upstream to node 1
[NOTICE] begin to stopp server at 2023-08-24 11:47:56.612680
[NOTICE] stopping server using "/home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/sys_ctl  -D '/data/kingbase/hac7/data' -l /home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/logfile -w -t 90 -m fast stop"
[NOTICE] stopp server finish at 2023-08-24 11:47:58.797572
[NOTICE] begin to start server at 2023-08-24 11:47:58.797665
[NOTICE] starting server using "/home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/sys_ctl  -w -t 90 -D '/data/kingbase/hac7/data' -l /home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/logfile start"
[NOTICE] start server finish at 2023-08-24 11:47:58.907499
[DEBUG] connecting to: "user=system connect_timeout=10 dbname=esrep host=192.168.1.102 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[WARNING] unable to connect to old upstream node 2 to remove replication slot
[HINT] if reusing this node, you should manually remove any inactive replication slots
[NOTICE] STANDBY FOLLOW successful
[DETAIL] standby attached to upstream node "node1" (ID: 1)
[2023-08-24 11:47:58] [DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.103 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
INFO:  set_repmgrd_pid(): provided pidfile is /home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/etc/hamgrd.pid
[2023-08-24 11:47:58] [DEBUG] update_node_record_set_upstream(): Updating node 3's upstream node to 1
[2023-08-24 11:47:58] [NOTICE] node "node3" (ID: 3) is now following primary node "node1" (ID: 1)
[2023-08-24 11:47:58] [DEBUG] monitor_streaming_standby()
[2023-08-24 11:47:58] [DEBUG] upstream node ID in local node record is 1
[2023-08-24 11:47:58] [DEBUG] connecting to upstream node 1: "host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-08-24 11:47:58] [DEBUG] connecting to: "user=system connect_timeout=10 dbname=esrep host=192.168.1.101 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[2023-08-24 11:47:58] [DEBUG] upstream node is primary
[2023-08-24 11:47:58] [DEBUG] primary_node_id is 1
[2023-08-24 11:47:58] [DEBUG] connection check type is "mix"

.......

5、查看node3的kingbase.auto.conf
如下所示,kiangbase.auto.conf被更新为与node1连接。

[kingbase@node103 bin]$ cat /data/kingbase/hac7/data/kingbase.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
wal_retrieve_retry_interval = '5000'
primary_conninfo = 'user=system connect_timeout=10 host=192.168.1.101 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node3'
primary_slot_name = 'repmgr_slot_3'

6、在node1上查看流复制
如下所示,node3和node1(primary)建立了流复制。

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_
start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_
lag | replay_lag | sync_priority | sync_state |          reply_time
-------+----------+---------+------------------+---------------+-----------------+-------------+-----------------
--------------+--------------+-----------+------------+------------+------------+------------+-----------+-------
----+------------+---------------+------------+-------------------------------
 29279 |       10 | system  | node3            | 192.168.1.103 |                 |       36038 | 2023-08-24 11:48
:00.187956+08 |              | streaming | 2/FE16AE98 | 2/FE16AE98 | 2/FE16AE98 | 2/FE16AE98 |           |
    |            |             1 | quorum     | 2023-08-24 11:53:22.358465+08
(1 row)

test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin  | catalog_xmin
| restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+--------------
+-------------+---------------------
 repmgr_slot_2 |        | physical  |        |          | f         | f      |            | 60809 |
| 2/FE169FB0  |
 repmgr_slot_3 |        | physical  |        |          | f         | t      |      29279 | 60819 |
| 2/FE16AE98  |
(2 rows)

7、恢复upstream节点数据库服务
如下所示,在原upstream节点node2数据库服务恢复后,集群架构没有发生变化,node3的upstream仍然是node1。

# 启动node2数据库服务
[kingbase@node102 bin]$ ./sys_ctl start -D /data/kingbase/hac7/data/
........
server started

# node1查看集群状态和流复制
[kingbase@node102 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_
start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_
lag | replay_lag | sync_priority | sync_state |          reply_time
-------+----------+---------+------------------+---------------+-----------------+-------------+-----------------
--------------+--------------+-----------+------------+------------+------------+------------+-----------+-------
----+------------+---------------+------------+-------------------------------
 29279 |       10 | system  | node3            | 192.168.1.103 |                 |       36038 | 2023-08-24 11:48
:00.187956+08 |              | streaming | 2/FE16BAB0 | 2/FE16BAB0 | 2/FE16BAB0 | 2/FE16BAB0 |           |
    |            |             1 | quorum     | 2023-08-24 11:56:21.772887+08
 31368 |       10 | system  | node2            | 192.168.1.102 |                 |       62287 | 2023-08-24 11:54
:25.132463+08 |              | streaming | 2/FE16BAB0 | 2/FE16BAB0 | 2/FE16BAB0 | 2/FE16BAB0 |           |
    |            |             1 | quorum     | 2023-08-24 11:56:17.049560+08
(2 rows)

案例二:

四、配置备库auto-recovery

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep -i recovery
recovery='standby'

1、upstream节点down后集群状态

[kingbase@node102 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby | - failed  | ? node1  | default  | 100      |          | ?       | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

[WARNING] following issues were detected
  - unable to connect to node "node2" (ID: 2)

  ----如上所示,node3的upstream更新为node1

2、node3节点hamgr.log
如下所示,node2节点数据库服务down后,node3节点在node2被recovery前,更新为node1。

[2023-08-24 16:11:23] [DEBUG] is_server_available(): ping status for "host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3" is PQPING_OK
[2023-08-24 16:11:23] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-08-24 16:11:23] [ERROR] unable to execute repmgr.set_upstream_last_seen()
[2023-08-24 16:11:23] [DETAIL] query text is:
SELECT repmgr.set_upstream_last_seen(2)
........

[2023-08-24 16:11:48] [NOTICE] upstream for local node "node3" (ID: 3) appears to have changed, restarting monitoring
[2023-08-24 16:11:48] [DETAIL] currently monitoring upstream 2; new upstream is 1
[2023-08-24 16:11:48] [DEBUG] monitor_streaming_standby()
[2023-08-24 16:11:48] [DEBUG] upstream node ID in local node record is 1

五、总结
对于级联备库,当其上游节点数据库服务down后,会自动将upstream节点更新为primary节点。

标签:11,10,备库,V8R6,运维,08,2023,24,keepalives
From: https://www.cnblogs.com/kingbase/p/17798330.html

相关文章

  • 软件项目管理全套文档模板(开发/实施/运维/安全/交付)
     前言:在软件项目管理中,每个阶段都有其特定的目标和活动,确保项目的顺利进行和最终的成功交付。以下是软件项目管理各个阶段的详细资料:软件项目全套文档资料下载:点我获取1.需求阶段目标:收集、分析和定义用户需求和业务目标。主要活动:需求调研:与用户沟通,了解他们的需求和......
  • KingbaseES V8R6集群运维案例之---主备failover切换原因分析
    案例说明:生产环境,KingbaseESV8R6的集群发生failover切换,分析集群切换的原因。适用版本:KingbaseESV8R6集群架构:137.xx.xx.67主原备库137.xx.xx.94原主库137.xx.xx.68vip地址一、日志分析1、分析原备库hamgr.log如下所示,通过原备库hamgr.log日志获取到具体......
  • KingbaseES V8R6数据库运维案例之---用户权限导致的备份恢复故障
    案例说明:由于限制了用户对数据库的访问,导致在执行‘sys_backup.shinit’初始化物理备份时,执行失败。适用版本:KingbaseESV8R6一、问题现象如下所示,执行‘sys_backup.shinit’初始化物理备份:1、执行初始化失败[kingbase@node201bin]$shsys_backup.shinitERROR:Con......
  • 高效运维_AIRIOT智慧电力运维解决方案
    可再生能源的引入带来了能源生产的去中心化和分散化趋势,同时也带来了能源输出的波动性和不确定性。电力运维因此需要更加灵活、智能的解决方案,以适应可再生能源的集成,确保电力系统的稳定运行,传统的电力运维管理方式往往存在如下痛点:数据管理和集成难度大:电力系统涉及大量的数据......
  • IT运维综合管理系统:提升效率、降低成本、保障安全
       随着信息技术的快速发展,企业对IT运维的需求也越来越高。IT运维综合管理系统作为一种集成化的解决方案,可以帮助企业提升运维效率、降低成本、保障信息安全。本文将从以下几个方面介绍IT运维综合管理系统的功能和优势。一、统一管理与监控   IT运维综合管理系统......
  • 【运维】在阿里云上搭建自己的图床,配合PicGo和Typora使用
    本文将详细介绍如何在阿里云上搭建自己的图床,包括购买OSS服务、配置域名解析、创建OSS存储桶和设置图片上传规则等步骤。希望对您有所帮助!一、购买OSS服务首先,我们需要在阿里云官网购买OSS(ObjectStorageService)服务。OSS是阿里云提供的一种海量、安全、低成本、高可靠的云存......
  • 【赛题解析】【网络建设与运维】第三阶段Linux Vsftpd部分答案解析
    培训、环境、资料、考证公众号:波比网络公众号2:波比网络工作室网络建设与运维群:685678820波比网络专注于技能提升,赋能ftp服务任务描述:请采用ftp服务器,实现文件安全传输。1.配置 linux1为ftp服务器,安装vsftpd,新建本地用户xiaoming,本地用户登陆ftp后的目录为/var/ft......
  • DevOps迈向标准化,平台工程让开发运维更轻松
    在近一代人的时间里,DevOps在软件开发和运维领域占据了主导地位。这是一套开发人员都离不开的技能和方法。PearlZhu在“TheDigitalMaster”一书中描述了它的重要性,强调“敏捷和DevOps是为了利用整合、互动和创新”。在当今竞争激烈的市场中,这一点尤为重要,因为IT管理团......
  • 服务器运维新手的第一台服务器学习教程
    目前刚接触服务器这一块的学习,这里记录一下解如何获取自己的第一台虚拟云服务器,给刚入行服务器开发的小伙伴做一个参考。具体的步骤如下:一、服务器的注册和获取1、打开bwg88服务器平台地址:点击进入https://bwh88.net/aff.php?aff=743202、进入到官网界面后如下图:3、点击注......
  • 部署、运维
    一、nginx部署、运维(一)部署新站-1)将网站文件拷至/usr/local/webs/目录下-2)配置vim/usr/local/nginx/conf/nginx.conf样例: server{ listen8081; server_namelocalhost:8081; location/{ root/usr/local/webs/ZnHotel#目录 ......