首页 > 其他分享 >KingbaseES V8R3集群运维案例之---集群启动“DATA_SIZE_DIFF 16 (MB)”故障

KingbaseES V8R3集群运维案例之---集群启动“DATA_SIZE_DIFF 16 (MB)”故障

时间:2024-03-29 18:45:37浏览次数:32  
标签:00 12 备库 运维 process 30 kingbase V8R3 集群

案例说明:
为保证集群数据的一致性安全,在主备库的数据相差“DATA_SIZE_DIFF >=16M"以上时,该备库不能参与主备切换,并且通过kingbase_monitor.sh启动集群时,集群将无法启动;本案例对此种故障做了复现,并测试了解决方法。

适用版本:
KingbaseES V8R3

适用版本:

KingbaseES V8R3

一、案例环境

1) 配置流复制架构
=== 此环境为一主一备的架构,并且配置流复制为异步模式,这样模拟备库数据库服务宕机,不能接收主库wal日志并应用,导致主备库之间数据差异增大===

# 配置synchronous_standby_names为null(无sync模式的备库)
test=# show synchronous_standby_names;
 synchronous_standby_names 
---------------------------

(1 row)

# 查看流复制同步模式
test=# select * from sys_stat_replication;
 pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   s
tate   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+----
-------+---------------+----------------+----------------+-----------------+---------------+------------
 6113 |       10 | SYSTEM  | node243          | 192.168.7.243 |                 |       57572 | 2021-03-01 12:04:21.733067+08 |              | str
eaming | 0/2B020560    | 0/2B020560     | 0/2B020560     | 0/2B020560      |             0 | async
(1 row)

2)关闭备库自动recovery和数据库服务

# 关闭crond服务中的network_rewind.sh自启动:
root@node3 ~]# cat /etc/cron.d/KINGBASECRON 
#*/1 * * * * kingbase . /etc/profile;/home/kingbase/cluster/R6HA/KHA/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf >> /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log 2>&1
#*/1 * * * * kingbase  /home/kingbase/cluster/kha/db/bin/network_rewind.sh
#*/1 * * * * root  /home/kingbase/cluster/kha/kingbasecluster/bin/restartcluster.sh

# 停止备库的数据库服务
[kingbase@node3 log]$ /home/kingbase/cluster/kha/db/bin/sys_ctl stop -D /data/kingbase/cluster/r3/data
waiting for server to shut down.... done
server stopped

二、在主库进行事务处理模拟生产

1)查看主库sys_xlog

[kingbase@node1 sys_xlog]$ ls -lh
total 1.1G
......
-rw------- 1 kingbase kingbase  16M Mar  1 12:22 000000020000000000000040
-rw------- 1 kingbase kingbase  16M Mar  1 12:22 000000020000000000000041
-rw------- 1 kingbase kingbase  16M Mar  1 12:22 000000020000000000000042
-rw------- 1 kingbase kingbase   41 Mar  1  2021 00000002.history
drwx------ 2 kingbase kingbase 4.0K Mar  1 12:22 archive_status

2)模拟主库事务处理

prod=# insert into t values (generate_series(1,10000000),'usr'||generate_series(1,10000000));
INSERT 0 10000000

3)查看主库sys_xlog日志的变化(主库wal日志在增长)

[kingbase@node1 sys_xlog]$ ls -lh
total 1.8G
......
-rw------- 1 kingbase kingbase  16M Mar  1 12:31 00000002000000000000006F
-rw------- 1 kingbase kingbase  16M Mar  1 12:31 000000020000000000000070
-rw------- 1 kingbase kingbase  16M Mar  1 12:32 000000020000000000000071
-rw------- 1 kingbase kingbase  16M Mar  1 12:33 000000020000000000000072
-rw------- 1 kingbase kingbase   41 Mar  1  2021 00000002.history
drwx------ 2 kingbase kingbase 8.0K Mar  1 12:32 archive_status

三、重新启动集群模拟故障

1) 通过kingbase_monitor.sh重启集群

[kingbase@node3 bin]$ ./kingbase_monitor.sh restart
-----------------------------------------------------------------------
2021-03-01 12:30:07 KingbaseES automation beging...
.......
Received 0 response(s)
ping vip 192.168.7.245 success ping times :[3], success times:[2]
ping vip 192.168.7.245 success ping times :[3], success times:[3]
The diff between the LSN of some standby server and then LSN of primary server is greater than DATA_SIZE_DIFF 16 (MB)
Some standby's data is less than primary, could not start up the whole Cluster

如下图所示:出现”DATA_SIZE_DIFF 16 (MB)“故障:

kingbase_monitor.sh检测diff lsn语句:

2)查看备库数据库服务和流复制状态

=如下所示,集群的kingbasecluster管理服务没有启动,但是数据库服务已经启动=

备库数据库服务:
[kingbase@node3 bin]$ ps -ef |grep kingbase

kingbase 16974     1  0 12:30 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974  0 12:30 ?        00:00:00 kingbase: logger process   
kingbase 16992 16974  8 12:30 ?        00:00:11 kingbase: startup process   recovering 00000002000000000000004C
kingbase 16996 16974  0 12:30 ?        00:00:00 kingbase: checkpointer process   
kingbase 16997 16974  0 12:30 ?        00:00:00 kingbase: writer process   
kingbase 16998 16974  0 12:30 ?        00:00:00 kingbase: stats collector process   
kingbase 17202 16974  1 12:30 ?        00:00:02 kingbase: wal receiver process   streaming 0/4D000000
kingbase 17727 17725  0 12:33 ?        00:00:00 /bin/bash /home/kingbase/cluster/kha/db/bin/network_rewind.sh


主库数据库服务:
[kingbase@node1 bin]$ ps -ef |grep kingbase

kingbase 27462     1  0 12:31 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 27472 27462  0 12:31 ?        00:00:00 kingbase: logger process   
kingbase 27475 27462  0 12:31 ?        00:00:00 kingbase: checkpointer process   
kingbase 27476 27462  0 12:31 ?        00:00:00 kingbase: writer process   
kingbase 27477 27462  0 12:31 ?        00:00:00 kingbase: wal writer process   
kingbase 27478 27462  0 12:31 ?        00:00:00 kingbase: autovacuum launcher process   
kingbase 27479 27462  0 12:31 ?        00:00:00 kingbase: archiver process   last was 000000020000000000000073
kingbase 27480 27462  0 12:31 ?        00:00:00 kingbase: stats collector process   
kingbase 27481 27462  0 12:31 ?        00:00:00 kingbase: bgworker: syslogical supervisor   
kingbase 27487 27462  0 12:31 ?        00:00:00 kingbase: wal sender process SYSTEM 192.168.7.243(59934) streaming 0/53680000
kingbase 28041 27462  0 12:32 ?        00:00:01 kingbase: autovacuum worker process   prod


流复制状态:
test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  s
tate  | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
------+---------------+----------------+----------------+-----------------+---------------+------------
 27487 |       10 | SYSTEM  | node243          | 192.168.7.243 |                 |       59934 | 2021-03-01 12:31:53.917879+08 |              | ca
tchup | 0/55B20000    | 0/54000000     | 0/54000000     | 0/52FFFFD8      |             0 | async
(1 row)

3)查看备库recovery状态

=备库启动数据库服务后,startup进程执行recovery=

[kingbase@node3 bin]$ ps -ef |grep kingbase

kingbase 16974     1  0 12:30 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974  0 12:30 ?        00:00:00 kingbase: logger process   
kingbase 16992 16974  4 12:30 ?        00:00:19 kingbase: startup process   recovering 000000020000000000000071
kingbase 16996 16974  0 12:30 ?        00:00:00 kingbase: checkpointer process   
kingbase 16997 16974  0 12:30 ?        00:00:00 kingbase: writer process   
kingbase 16998 16974  0 12:30 ?        00:00:00 kingbase: stats collector process   
kingbase 17202 16974  1 12:30 ?        00:00:04 kingbase: wal receiver process   streaming 0/72000000
  如下所示:备库数据库服务刚启动时,startup进程从”00000002000000000000004C“开始做recovery,已经应用wal日志到了”000000020000000000000071“。

=通过观察sys_stat_replication视图也能看到lsn的变化,说明备库一直在做recovery=

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  s
tate  | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
------+---------------+----------------+----------------+-----------------+---------------+------------
 27487 |       10 | SYSTEM  | node243          | 192.168.7.243 |                 |       59934 | 2021-03-01 12:31:53.917879+08 |              | ca
tchup | 0/715E0000    | 0/70000000     | 0/70000000     | 0/6EFFFE98      |             0 | async

流复制状态是“catchup”:

4)备库完成recovery

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   
state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
--------+---------------+----------------+----------------+-----------------+---------------+------------
 27487 |       10 | SYSTEM  | node243          | 192.168.7.243 |                 |       59934 | 2021-03-01 12:31:53.917879+08 |              | st
reaming | 0/81511BE8    | 0/81511BE8     | 0/81511BE8     | 0/81511BE8      |             0 | async
(1 row)
如下所示,此时备库的replay_location和其他的lsn已经相同,说明备库已经完成了wal日志的应用(当然本案例是在主库没有业务的情况下,replay_location和其他的lsn同步,但在生产情况下,业务繁忙时,很难达到同步,只能是缩小差异)。

流复制状态是“streaming”:

查看备库数据库服务:

[kingbase@node3 bin]$ ps -ef |grep kingbase

kingbase 16974     1  0 12:30 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974  0 12:30 ?        00:00:00 kingbase: logger process   
kingbase 16992 16974  3 12:30 ?        00:00:20 kingbase: startup process   recovering 000000020000000000000081
kingbase 16996 16974  0 12:30 ?        00:00:01 kingbase: checkpointer process   
kingbase 16997 16974  0 12:30 ?        00:00:00 kingbase: writer process   
kingbase 16998 16974  0 12:30 ?        00:00:00 kingbase: stats collector process   
kingbase 17202 16974  0 12:30 ?        00:00:05 kingbase: wal receiver process   streaming 0/81511BE8

=== 如上所示:startup进程应用的wal日志已经是备库的最后的wal日志,说明备库已经完成了recovery===

查看备库sys_xlog日志:

[kingbase@node3 bin]$ ls -lh /data/kingbase/cluster/r3/data/sys_xlog/
total 2.0G
......
-rw------- 1 kingbase kingbase  16M Mar  1 12:40 00000002000000000000007E
-rw------- 1 kingbase kingbase  16M Mar  1 12:40 00000002000000000000007F
-rw------- 1 kingbase kingbase  16M Mar  1 12:40 000000020000000000000080
-rw------- 1 kingbase kingbase  16M Mar  1 12:40 000000020000000000000081
-rw------- 1 kingbase kingbase   41 Mar  1 12:07 00000002.history
drwx------ 2 kingbase kingbase 8.0K Mar  1 12:40 archive_status

四、启动集群

kingbase@node3 bin]$ ./kingbase_monitor.sh restart
-----------------------------------------------------------------------
2021-03-01 12:43:23 KingbaseES automation beging...
......
......................
all started..
...
now we check again
=======================================================================
|             ip |                       program|              [status] 
[  192.168.7.243]|             [kingbasecluster]|              [active]
[  192.168.7.248]|             [kingbasecluster]|              [active]
[  192.168.7.243]|                    [kingbase]|              [active]
[  192.168.7.248]|                    [kingbase]|              [active]
=======================================================================

=如上所示,集群启动正常,备库已经和主库缩小了数据差异,集群可以正常启动=

五、总结

对于集群出现”DATA_SIZE_DIFF >=16M"故障时,可以首先查看主备库的数据库服务,保证主备库的数据库服务都启动正常,并且流复制状态正常。备库在流复制正常的,并且主库的wal日志(归档和在线日志)都完整情况下,可以通过startup进程来做recovery,缩小和主库的数据差异。对于生产环境需要一个较长的时间做recovery,如果需要快速同步,可以重做备库。

附件:

主备库lsn相差(DATA_SIZE_DIFF)超过16M,cluster无法启动

适用版本:V8R3
问题说明:KingbaseES R3集群启动,主备库lsn相差(DATA_SIZE_DIFF)超过16M, cluster无法启动。
问题现象:

问题分析:
为避免备库和主库之间数据差异较大,启动集群时,发生failover,切换到备库,然后使用sys_rewind导致原主库事务回卷,导致数据丢失,所以当主备库lsn相差(DATA_SIZE_DIFF)超过16M时,集群无法启动。
问题解决:
可以不启动集群管理,手工启动主备库数据库服务,启动流复制,备库数据通过流复制,追赶主库数据,等数据差异小于阈值时,再启动集群。
或在Server/etc/HAmodule.conf文件 和 Cluster/etc/HAmodule.conf 这两个配置文件最后加上DATA_SIZE_DIFF=64 。

标签:00,12,备库,运维,process,30,kingbase,V8R3,集群
From: https://www.cnblogs.com/kingbase/p/17923262.html

相关文章

  • KingbaseES V8R3备份恢复案例之---backup_label does not exist in KINGBASE_DATA
    案例说明:在KingbaseESV8R3集群主库执行sys_rman的全备时,出现‘backup_labeldoesnotexistinKINGBASE_DATA’的故障,如下图所示:适用版本:KingbaseESV8R3一、问题现象如下所示,数据库执行sys_rman物理备份:[kingbase@node201bin]$./sys_rman-Usystem-W123456-dtes......
  • KingbaseES集群运维案例之-- V8R3与V8R6集群wal函数应用
    案例说明:KingbaseESV8R3和V8R6集群在通过函数获取wal日志的相关信息时,两个版本的函数名称不同,本案例做了函数应用的对比和总结。适用版本:KingbaseESV8R3/R6一、KingbaseESV8R3相关函数Tips:在V8R3的版本,事务日志名称为xlog。1、查询数据库支持的函数test=#selectpron......
  • kingbaseES V8R6集群运维案例之---配置priority防止failover切换案例
    案例说明:在一主多备的架构中,需要配置一台备库在主备切换时,不能选举为主库。对于repmgr主备切换主库的选择算法如下:Tips:Repmgr选举候选备节点会以以下顺序选举:LSN---->Priority---->Node_ID。系统会先选举一个LSN比较大者作为候选备节点;如LSN一样,会根据Priority优先级进行比......
  • KingbaseES V8R6集群运维案例之---级联备库upstream节点故障
    KingbaseESV8R6集群运维案例之---级联备库upstream节点故障案例说明:在KingbaseESV8R6集群,构建级联备库后,在其upstream的节点故障后,级联备库如何处理?适用版本:KingbaseESV8R6集群架构:案例一:一、配置集群的recovery参数(allnodes)Tips:关闭备库的aut-recovery机制......
  • java postgres单体库迁移postgres集群库java
    packagecom.slsl.digital.twin.manage.controller.project;importcom.google.common.collect.Lists;importcom.slsl.digital.twin.common.utils.CollectionUtils;importjava.sql.*;importjava.util.*;importjava.util.stream.Collectors;publicclassTest{......
  • 基于containerd 部署 kubernetes 1.28集群
    1、准备说明8台Linux主机,安装Ubuntu20.04系统,其中2台haproxy,3台master节点,3台work节点每台主机不低于2GB内存大小,CPU大于2核心集群中的所有主机网络互通节点中不能存在重复的主机名、mac地址或者product_uuid交换分区配置。kubelet默认是在节点上检测到交换分区时,无法启动......
  • 软件项目管理全套文档模板(开发/实施/运维/安全/交付)
     前言:在软件项目管理中,每个阶段都有其特定的目标和活动,确保项目的顺利进行和最终的成功交付。以下是软件项目管理各个阶段的详细资料:软件项目全套文档资料下载:点我获取1.需求阶段目标:收集、分析和定义用户需求和业务目标。主要活动:需求调研:与用户沟通,了解他们的需求和......
  • 在Linux中,如何配置和管理LVS集群?
    配置和管理LinuxVirtualServer(LVS)集群涉及以下几个主要步骤:环境准备:操作系统:确保所有参与集群的节点运行兼容的Linux发行版,如CentOS、Ubuntu等。内核支持:确认内核版本支持LVS所需的IP负载均衡模块,如IPVS(IPVirtualServer)。对于较新的内核,IPVS通常已集成。若需编译内核,请......
  • 在Linux中,心跳机制在Linux高可用集群中作用是什么?
    在Linux高可用(HighAvailability,HA)集群中,心跳机制发挥着至关重要的作用,主要体现在以下几个方面:节点状态监测:节点间连通性确认:心跳机制允许集群中的各个节点周期性地向其他节点发送心跳信号(通常是简单的消息或数据包),以此表明自己处于正常运行状态。接收节点通过及时响应或确......
  • KingbaseES V8R6集群运维案例之---主备failover切换原因分析
    案例说明:生产环境,KingbaseESV8R6的集群发生failover切换,分析集群切换的原因。适用版本:KingbaseESV8R6集群架构:137.xx.xx.67主原备库137.xx.xx.94原主库137.xx.xx.68vip地址一、日志分析1、分析原备库hamgr.log如下所示,通过原备库hamgr.log日志获取到具体......