一、环境介绍
数据库 PanWeiDB V2.0-S3.0.0_B01 架构 Intel + x86_64 操作系统 BCLinux-for-Euler-21.10 内核 4.19.90-2107.6.0.0192.8.oe1.bclinux.x86_64
二、故障场景
1、客户环境可以稳定复现
gsql -r show events; \c bomcdb; -------- 业务库名称 show events; -------- database coredump
2、故障截图
3、查看数据库中的定时作业
三、研发分析
1、公司内部申请"符号表",获取后上传到数据库节点主机,参考《如何使用gdb分析数据库实例crash问题(文档编号 10381.1)》完成"符号表"配置。
2、错误分析定位
gdb堆栈信息:
[Current thread is 1 (LWP 4164310)] (gdb) bt #0 0x0000000001b60b16 in heap_form_minimal_tuple (tupleDescriptor=0x14fd369c1c50, values=0x14fd3cc15680, isnull=0x14fd3cc15720, inTuple=0x0) at heaptuple.cpp:2047 #1 0x000000000129b54e in tuplestore_puttupleslot (state=0x14fd36b14050, slot=<optimized out>, need_transform_anyarray=<optimized out>) at tuplestore.cpp:778 #2 0x0000000001914670 in do_tup_output (tstate=tstate@entry=0x14fd3cc15530, values=values@entry=0x14fd1e9c1eb0, values_len=values_len@entry=10, is_null=is_null@entry=0x14fd1e9c1ea6, is_null_len=is_null_len@entry=10) at execTuples.cpp:1220 #3 0x000000000174f878 in ShowEventCommand (stmt=stmt@entry=0x14fd3cc60cc8, dest=dest@entry=0x14fd3cc15490) at eventcmds.cpp:913 #4 0x000000000186a5b9 in standard_ProcessUtility (processutility_cxt=<optimized out>, dest=0x14fd3cc15490, sent_to_remote=<optimized out>, completion_tag=0x14fd1e9c27a0 "", context=PROCESS_UTILITY_TOPLEVEL, isCTAS=<optimized out>) at utility.cpp:3793 #5 0x000014ffef1cf18f in pgss_ProcessUtility (processutility_cxt=0x14fd1e9c2730, dest=0x14fd3cc15490, sentToRemote=<optimized out>, completionTag=0x14fd1e9c27a0 "", context=PROCESS_UTILITY_TOPLEVEL, isCTAS=<optimized out>) at pg_stat_statements.cpp:787 #6 0x000000000187429b in pgaudit_ProcessUtility (processutility_cxt=0x14fd1e9c2730, dest=<optimized out>, sentToRemote=<optimized out>, completionTag=<optimized out>, context=<optimized out>, isCTAS=<optimized out>) at auditfuncs.cpp:1532 #7 0x000000000186e78a in ProcessUtility (processutility_cxt=0x14fd1e9c2730, dest=0x14fd3cc15490, sent_to_remote=false, completion_tag=0x14fd1e9c27a0 "", context=<optimized out>, isCTAS=<optimized out>) at utility.cpp:1664 #8 0x0000000001860823 in PortalRunUtility (portal=portal@entry=0x14fd36b1a050, utilityStmt=0x14fd3cc60cc8, isTopLevel=isTopLevel@entry=true, dest=dest@entry=0x14fd3cc15490, completionTag=completionTag@entry=0x14fd1e9c27a0 "") at pquery.cpp:1777 #9 0x00000000018616f3 in FillPortalStore (portal=portal@entry=0x14fd36b1a050, isTopLevel=isTopLevel@entry=true) at pquery.cpp:1571 #10 0x0000000001862ad5 in PortalRun (portal=portal@entry=0x14fd36b1a050, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, dest=dest@entry=0x14fd3cc60d78, altdest=altdest@entry=0x14fd3cc60d78, completionTag=completionTag@entry=0x14fd1e9c2a80 "") at pquery.cpp:1174 #11 0x0000000001856c1c in exec_simple_query (query_string=<optimized out>, query_string@entry=0x14fd3cc60050 "show events ;", msg=msg@entry=0x14fd1e9c2bf0, messageType=QUERY_MESSAGE) at postgres.cpp:3399 #12 0x000000000185ccb2 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x14fd3bc39d90, dbname=<optimized out>, username=<optimized out>) at postgres.cpp:9894 #13 0x00000000017ae2a6 in BackendRun (port=port@entry=0x14fd1e9c3170) at postmaster.cpp:10046 #14 0x00000000017d60ec in GaussDbThreadMain<(knl_thread_role)1> (arg=0x14fdb14f2a60) at postmaster.cpp:14871 #15 0x00000000017ae331 in InternalThreadFunc (args=<optimized out>) at postmaster.cpp:15520 #16 0x000014ffdd17df1b in ?? () from /usr/lib64/libpthread.so.0 #17 0x000014ffdd0b333f in clone () from /usr/lib64/libc.so.6
四、结论
宕机原因分析如下
"show events"命令,values值=空
系统库 该命令 正常运行
业务库 从0开始 第8列 有个datum=0值
根因定位,和第8列failure_msg信息的数据有关,如果不为NULL的数据中间夹着NULL数据,就会导致实例内核core崩溃。
pg_job的数据有问题,正常job_name,end_date,enable都不应该为空。
但是在创建时,未对这几个字段做非空限制。
临时解决方案:
PKG_SERVICE.job_cancel把pg_job里面的测试任务删除掉。
永久解决方案:
PKG_SERVICE.JOB_SUBMIT创建作业时,若这几个字段未赋值,报错提示非空。
在磐维数据库的1030版本修复