首页 > 其他分享 >KingbaseES V8R3集群运维案例---主库OOM故障分析

KingbaseES V8R3集群运维案例---主库OOM故障分析

时间:2024-04-01 15:47:42浏览次数:13  
标签:主库 kernel V8R3 15 48 运维 18 Jul db0001

案例说明:
KingbaseES V8R3集群,主库数据库OOM,产生core,请帮忙分析。数据库内存64Gb,为华为云虚拟机,无swap。
适用版本:
KingbaseES V8R3

一、问题分析

1、查看sys_log数据库OOM信息

PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
    PortalHeapMemory: 1024 total in 1 blocks; 968 free (0 chunks); 56 used
  Relcache by OID: 24576 total in 2 blocks; 12976 free (4 chunks); 11600 used
  CacheMemoryContext: 516096 total in 6 blocks; 159416 free (0 chunks); 356680 used
    CachedPlan: 1024 total in 1 blocks; 784 free (0 chunks); 240 used
    SYS_EXTENSION_OID_INDEX: 1024 total in 1 blocks; 408 free (0 chunks); 616 used
    SYS_EXTENSION_NAME_INDEX: 1024 total in 1 blocks; 408 free (0 chunks); 616 used
    SYS_DB_ROLE_SETTING_DATABASEID_ROL_INDEX: 1024 total in 1 blocks; 320 free (0 chunks); 704 used
    SYS_OPCLASS_AM_NAME_NSP_INDEX: 1024 total in 1 blocks; 24 free (0 chunks); 1000 used
    SYS_FOREIGN_DATA_WRAPPER_NAME_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
    SYS_SYNONYM_NAME_C_N_INDEX: 1024 total in 1 blocks; 320 free (0 chunks); 704 used
    SYS_ENUM_OID_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
    SYS_CLASS_RELNAME_NSP_INDEX: 1024 total in 1 blocks; 272 free (0 chunks); 752 used
    SYS_FOREIGN_SERVER_OID_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
    SYS_STATISTIC_RELID_ATT_INH_INDEX: 1024 total in 1 blocks; 24 free (0 chunks); 1000 used
    SYS_CAST_SOURCE_TARGET_INDEX: 1024 total in 1 blocks; 320 free (0 chunks); 704 used
    SYS_PKGVARIABLE_OID_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
    SYS_LANGUAGE_NAME_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
    SYS_PACKAGE_OID_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
........

2、主库节点recovery.log日志
如下图所示,主库recovery时出现动态库加载失败及内存访问失败:

3、系统message日志

Jul 18 15:00:11 db0001 com.deepin.api.XEventMonitor[20792]: /usr/lib/deepin-daemon/dde-session-daemon: error while loading shared libraries: libgdk-3.so.0: failed to map segment from shared object
Jul 18 15:00:28 db0001 kernel: [56881707.808160] detected fb_set_par error, error code: -16
.........
Jul 18 15:00:30 db0001 com.deepin.dde.lockFront[20792]: 2023-07-18, 15:00:29.938 [Debug  ] [                                                         0] Failed message: "请输入密码"
Jul 18 15:00:30 db0001 com.deepin.daemon.Zone[20792]: /usr/lib/deepin-daemon/dde-session-daemon: error while loading shared libraries: libatk-1.0.so.0: failed to map segment from shared object
Jul 18 15:01:37 db0001 com.deepin.dde.desktop[20792]: QThread::start: Thread creation error: 资源暂时不可用
Jul 18 15:01:37 db0001 com.deepin.dde.desktop[20792]: QThread::start: Thread creation error: 资源暂时不可用
.........
Jul 18 15:02:49 db0001 com.deepin.dde.desktop[20792]: /usr/bin/dde-desktop: error while loading shared libraries: libXau.so.6: failed to map segment from shared object
Jul 18 15:02:59 db0001 com.deepin.dde.desktop[20792]: /usr/bin/dde-desktop: error while loading shared libraries: libQt5XdgIconLoader.so.3: failed to map segment from shared object
........                                         
Jul 18 15:03:09 db0001 com.deepin.dde.desktop[20792]: /usr/bin/dde-desktop: error while loading shared libraries: libcom_err.so.2: failed to map segment from shared object
Jul 18 15:03:21 db0001 com.deepin.dde.desktop[20792]: out of memory
Jul 18 15:03:31 db0001 com.deepin.dde.desktop[20792]: (process:5739): GLib-ERROR (recursed) **: ../../../glib/gmem.c:135: failed to allocate 16368 bytes

如下图所示,系统进程OOM信息:

4、message日志记录kingbase进程stack error

Jul 18 15:48:45 db0001 kernel: [56884604.361326] CPU: 24 PID: 27697 Comm: kingbase Tainted: G      D           4.19.0-arm64-server #3017
Jul 18 15:48:45 db0001 kernel: [56884604.362620] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Jul 18 15:48:45 db0001 kernel: [56884604.363585] pstate: 20400005 (nzCv daif +PAN -UAO)
Jul 18 15:48:45 db0001 kernel: [56884604.364214] pc : do_last+0x44/0x848
Jul 18 15:48:45 db0001 kernel: [56884604.364620] lr : path_openat+0x60/0x238
Jul 18 15:48:45 db0001 kernel: [56884604.365007] sp : ffff80013f5f7bf0
Jul 18 15:48:45 db0001 kernel: [56884604.365435] x29: ffff80013f5f7bf0 x28: ffff80009e9fc780 
Jul 18 15:48:45 db0001 kernel: [56884604.366086] x27: ffff80013f5f7e4c x26: 0000000000000000 
Jul 18 15:48:45 db0001 kernel: [56884604.366683] x25: 0000000056000000 x24: 0000000000000200 
Jul 18 15:48:45 db0001 kernel: [56884604.367419] x23: ffff8000ef81c280 x22: 0000000000000002 
Jul 18 15:48:45 db0001 kernel: [56884604.368032] x21: ffff800f15d3fd00 x20: 0000000000020241 
Jul 18 15:48:45 db0001 kernel: [56884604.368612] x19: ffff80013f5f7d28 x18: 0000000000000000 
Jul 18 15:48:45 db0001 kernel: [56884604.369224] x17: 0000000000000000 x16: 0000000000000000 
Jul 18 15:48:45 db0001 kernel: [56884604.369979] x15: 0000000000000000 x14: 0000000000000000 
Jul 18 15:48:45 db0001 kernel: [56884604.370787] x13: 0000000000000000 x12: 0000000000000000 
Jul 18 15:48:45 db0001 kernel: [56884604.371731] x11: 0000000000000000 x10: d0d0a0d0a0d0a0bd 
Jul 18 15:48:45 db0001 kernel: [56884604.373494] x9 : 72980288f3b72329 x8 : c1647d4c29ee3c8e 
Jul 18 15:48:45 db0001 kernel: [56884604.374038] x7 : 2df9567f81f8954b x6 : b3fc0fc4aa596fca 
Jul 18 15:48:45 db0001 kernel: [56884604.374571] x5 : 000000000000000a x4 : feff0eff0eff0f00 
Jul 18 15:48:45 db0001 kernel: [56884604.375104] x3 : 0000000000000000 x2 : 0000000000000000 
Jul 18 15:48:45 db0001 kernel: [56884604.375661] x1 : 0000000000000051 x0 : ffff80013f5f7d28 
Jul 18 15:48:45 db0001 kernel: [56884604.376760] Call trace:
Jul 18 15:48:45 db0001 kernel: [56884604.377024]  do_last+0x44/0x848
Jul 18 15:48:45 db0001 kernel: [56884604.377365]  path_openat+0x60/0x238
Jul 18 15:48:45 db0001 kernel: [56884604.378283]  do_filp_open+0x60/0xc0
Jul 18 15:48:45 db0001 kernel: [56884604.378654]  do_sys_open+0x164/0x1f0
Jul 18 15:48:45 db0001 kernel: [56884604.379036]  __arm64_sys_openat+0x20/0x28
Jul 18 15:48:45 db0001 kernel: [56884604.379444]  el0_svc_common+0x90/0x160
Jul 18 15:48:45 db0001 kernel: [56884604.379830]  el0_svc_handler+0x9c/0xa8
Jul 18 15:48:45 db0001 kernel: [56884604.380215]  el0_svc+0x8/0xc
Jul 18 15:48:45 db0001 kernel: [56884604.381133] ---[ end trace e4652b3ad8a636a3 ]---

二、问题解决
从以上系统的message信息可以获知,数据库服务在15:48左右,出现stack error,导致数据库服务出现OOM故障;但是在15:00左右,系统message日志看,其他进程已经出现动态库加载故障及OOM问题,所以数据库的OOM,应该是整个系统出现了内存资源紧张导致,而不是数据库自身应用问题。
经系统人员检查,发现内存不足时杀毒软件占用了十几个G,重启杀毒软件后内存下降,待后续观察。

三、总结
对于数据库故障问题,除了对数据库自身的日志信息进行分析,还要结合故障时间点对整个主机的状态进行分析,找到问题发生的根本原因。

标签:主库,kernel,V8R3,15,48,运维,18,Jul,db0001
From: https://www.cnblogs.com/kingbase/p/17736818.html

相关文章

  • KingbaseES V8R6集群运维案例之---备节点恢复为单实例库
    KingbaseESV8R6集群运维案例之---备节点恢复为单实例库案例说明:在生产环境中,手工将集群节点恢复为单实例节点,操作可以分为两步。第一步,先将节点从repmgr管理中注销,脱离集群的管理;第二步,从流复制中拆分节点,成为单实例节点。适用版本:KingbaseESV8R6集群架构:ID......
  • SpringBoot运维学习笔记
    打包与运行windows打包与运行windows打包与运行,linux程序运行服务启动失败:没有主清单属性【没有打包插件】打包插件的作用:https://www.bilibili.com/video/BV15b4y1a7yG?p=55mvnpackagemaven打包的时候会执行测试的流程,运行test里面的代码,会导致数据有一些变化;打包插......
  • 软件项目管理(开发/实施/运维/安全/交付)全套文档模板
      前言:在软件项目管理中,每个阶段都有其特定的目标和活动,确保项目的顺利进行和最终的成功交付。以下是软件项目管理各个阶段的详细资料:软件项目全套文档资料下载:点我获取1.需求阶段目标:收集、分析和定义用户需求和业务目标。主要活动:需求调研:与用户沟通,了解他们的需求......
  • Linux(4)常见操作整理-静态路由-双网卡-文件上传下载-运维思路-性能监测方法-jar包查找
    五、常见操作1、静态路由配置【描述】:当前ifconfigeno16777728对应ip:172.41.0.120【解决】:(1)[root@localhost~]#cd/etc/sysconfig/network-scripts/(2)添加文件:route-eno16777728​172.41.200.0/24via172.41.0.253deveno16777728​172.41.202.0/24via172......
  • KingbaseES V8R3集群运维案例之---failover切换后新主库启动过程
    案例说明:KingbaseESV8R3集群failover切换后,在生产环境中,新主库启动过程中可能会有业务访问,出现‘系统只读’的问题。如下图所示:适用版本:KingbaseESV8R3一、问题分析1、如下所示,failover切换过程:1)在master节点执行failover_stream.sh脚本执行failover切换。2)ping网关地......
  • KingbaseES V8R6集群运维案例之---PGPASSWORD变量导致esrep用户连接主库失败
    案例说明:KingbaseESV8R6集群,在备库执行clone时,esrep用户认证失败,导致clone失败。适用版本:KingbaseESV8R6一、问题现象如下所示,在执行备库clone是,esrep认证失败:备库sys_log日志:(esrep用户认证失败)二、问题分析对于KingbaseESV8R6集群,esrep的用户通过~/.encpwd建立认证(......
  • KingbaseES V8R6集群运维案例之---备库register故障
    案例说明:据现场实施人员说,备库执行了clone,启动数据库服务,执行'repmgrstandbyregister'后,无法将备库register到集群。适用版本:KingbaseESV8R6一、问题现象如下图所示,执行'repmgrstandbyregister',register失败:二、问题分析1、repmgrstandbyregister分析如下图所示:......
  • KingbaseES V8R3集群运维案例之---集群启动“DATA_SIZE_DIFF 16 (MB)”故障
    案例说明:为保证集群数据的一致性安全,在主备库的数据相差“DATA_SIZE_DIFF>=16M"以上时,该备库不能参与主备切换,并且通过kingbase_monitor.sh启动集群时,集群将无法启动;本案例对此种故障做了复现,并测试了解决方法。适用版本:KingbaseESV8R3适用版本:KingbaseESV8R3一、案例......
  • KingbaseES V8R3备份恢复案例之---backup_label does not exist in KINGBASE_DATA
    案例说明:在KingbaseESV8R3集群主库执行sys_rman的全备时,出现‘backup_labeldoesnotexistinKINGBASE_DATA’的故障,如下图所示:适用版本:KingbaseESV8R3一、问题现象如下所示,数据库执行sys_rman物理备份:[kingbase@node201bin]$./sys_rman-Usystem-W123456-dtes......
  • KingbaseES集群运维案例之-- V8R3与V8R6集群wal函数应用
    案例说明:KingbaseESV8R3和V8R6集群在通过函数获取wal日志的相关信息时,两个版本的函数名称不同,本案例做了函数应用的对比和总结。适用版本:KingbaseESV8R3/R6一、KingbaseESV8R3相关函数Tips:在V8R3的版本,事务日志名称为xlog。1、查询数据库支持的函数test=#selectpron......