数据库版本:11.2.0.4 RAC环境
操作系统版本:Asianux Server release 7.3
数据库报错分析
- 接到业务消息,应用无法访问,开发人员查看日志后发现无法连接数据库。
- 查看数据库进程,发现无数据库进程存在,命令如下:
ps -ef | grep smon
- 查看数据库alert日志,报错信息如下:
PMON (ospid: 248987): terminating the instance due to error 484
Wed Feb 15 08:53:54 2023
System state dump requested by (instance=1, osid=248987 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /opt/oracle/oracle/diag/rdbms/xxxx/xxxxx/trace/xxxxx_diag_248998_20230215085354.trc
Instance terminated by PMON, pid = 248987
- 并无其它ORA类错误,查看trc文件(仅截取部分内容)后发现,仅有当时系统状态,也无实际报错。
SYSTEM STATE (level=10)
------------
System global information:
processes: base 0x4a857feee0, size 23040, cleanup 0x4a45a00830
allocation: free sessions 0x4ae673e408, free calls (nil)
control alloc errors: 1092 (process), 1092 (session), 1092 (call)
PMON latch cleanup depth: 1
seconds since PMON's last scan for dead processes: 2
system statistics:
0 OS CPU Qt wait time
2218880 Requests to/from client
2123 logons cumulative
377 logons current
316895 opened cursors cumulative
4953 opened cursors current
12983 user commits
0 user rollbacks
2360027 user calls
1544560 recursive calls
2824397 recursive cpu usage
139 pinned cursors current
503 user logons cumulative
206 user logouts cumulative
916157130 session logical reads
0 session logical reads in local numa group
0 session logical reads in remote numa group
0 session stored procedure space
1814442 CPU used when call started
4648429 CPU used by this session
13669191 DB time
14290758662 cluster wait time
21153141850 concurrency wait time
458454660 application wait time
11148959323 user I/O wait time
9 scheduler wait time
87484635410 non-idle wait time
54628931 non-idle wait count
6541375703609 in call idle wait time
1676394729 session connect time
1676422142 process last non-idle time
511617120 session uga memory
29236137136 session uga memory max
2167152 messages sent
2167153 messages received
1542351 background timeouts
1 remote Oradebug requests
4044554832 session pga memory
33068958032 session pga memory max
0 recursive system API invocations
27720 enqueue timeouts
35968 enqueue waits
0 enqueue deadlocks
5063524 enqueue requests
129584 enqueue conversions
5035305 enqueue releases
1496013 global enqueue gets sync
29624 global enqueue gets async
536653 global enqueue get time
1378253 global enqueue releases
12359025 physical read total IO requests
2788577 physical read total multi block requests
- 因昨天宕机原因是因为MTU网络问题导致,故分析问题被带偏,一直向节点被驱逐方向查找,查找了grid等日志。之前的报错记录与解决方法如下:
# 解决方法:
# 在sysctl中添加
net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640
#日志报错
Fri Feb 10 22:00:14 2023
Archived Log entry 91604 added for thread 1 sequence 48514 ID 0xef764f9c dest 1:
Fri Feb 10 22:33:21 2023
IPC Send timeout detected. Receiver ospid 7259 [
Fri Feb 10 22:33:21 2023
Errors in file /opt/oracle/oracle/diag/rdbms/xxx/xxx/trace/xxx1_lms6_7259.trc:
Fri Feb 10 22:34:11 2023
Detected an inconsistent instance membership by instance 2
Fri Feb 10 22:34:11 2023
Received an instance abort message from instance 2Fri Feb 10 22:34:11 2023
Received an instance abort message from instance 2
Please check instance 2 alert and LMON trace files for detail.
Please check instance 2 alert and LMON trace files for detail.
Fri Feb 10 22:34:11 2023
System state dump requested by (instance=1, osid=7235 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /opt/oracle/oracle/diag/rdbms/xxx/xxx/trace/xxx1_diag_7221_20230210223411.trc
LMS0 (ospid: 7235): terminating the instance due to error 481
Instance terminated by LMS0, pid = 7235
- 最终查找MOS,找到一篇文章:PMON Terminating the Instance Due To Error 822 (Doc ID 2342018.1)
The collected OS error log shows "Out of memory" and "Free swap = 0kB" errors, and it proved that mman process got killed due to "Out of memory" problem:
# 简单来说就是系统内存溢出,pmon进程被杀掉。
- 防止内存溢出的方法可以配置hugepage来解决,但系统已经配置过hugepage参数了(配置方法见附件),不应该出现此错误,但还是需要排查一下,于是查看了操作系统的message日志。
view /var/log/message
# 找到了下面的日志,果然是内存溢出。
Feb 15 08:53:53 jcsjdb1 kernel: Out of memory: Kill process 249044 (oracle) score 8 or sacrifice child
Feb 15 08:53:53 jcsjdb1 kernel: Killed process 249044 (oracle) total-vm:317602672kB, anon-rss:161904kB, file-rss:3704kB, shmem-rss:8386616kB
Feb 15 08:53:53 jcsjdb1 rtkit-daemon[4520]: Demoted 1 threads.
- 查看hugepage配置
# 确实是配置了
sysctl -a | grep hugepage
vm.hugepages_treat_as_movable = 0
vm.nr_hugepages = 327680
vm.nr_hugepages_mempolicy = 327680
vm.nr_overcommit_hugepages = 0
- hugepage配置完成后,oracle在启动时可以查看参数确定是否使用了hugepage,查看启动日志:
************************ Large Pages Information *******************
Per process system memlock (soft) limit = 64 KB # 注意看这里,可与第10步比对
Total Shared Global Region in Large Pages = 0 KB (0%) # 注意看这里,配置正常应该是100%
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 327680 (640 GB)
Large Pages configured system wide = 327680 (640 GB)
Large Page size = 2048 KB
RECOMMENDATION:
Total System Global Area size is 303 GB. For optimal performance,
prior to the next instance restart:
1. Large pages are automatically locked into physical memory.
Increase the per process memlock (soft) limit to at least 303 GB to lock
100% System Global Area's large pages into physical memory
********************************************************************
- 对比以前的启动参数
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED # 注意看这里,与第九步比对
Total Shared Global Region in Large Pages = 303 GB (100%)
Large Pages used by this instance: 154881 (303 GB)
Large Pages unused system wide = 172799 (337 GB)
Large Pages configured system wide = 327680 (640 GB)
Large Page size = 2048 KB
********************************************************************
- 之后查看了limis.conf参数配置,发现memlock参数未配置
最后总结
所以这次的问题是因为昨天的重启,导致数据库启动时limit参数的memlock默认由UNLIMITED变为了64KB,导致hugepage无法生效导致的,所以将memlock配置上之后,再在升级窗口重启数据库即可。
反思
本次排查问题时,忘记查看操作系统的message日志,也与昨天的宕机误导有关,之后排查问题时,也要注意支查看一下操作系统日志。
附 hugepage配置方法
-
停止数据库实例
-
查看当前系统是否配置HugePages
# 下面的查询中HugePages相关的几个值都为0,表明当前未配值HugePages,其次可以看到Hugepagesize为2MB。
$ grep Huge /proc/meminfo
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
- 修改用户的memlock限制
# 通过修改/etc/security/limits.conf 配值文件来实现
# 该参数的值通常配值位略小于当前的已安装系统内存,如当前你的系统内存为600G,可以做如下设置
* soft memlock 500000000
* hard memlock 500000000
# 上述的设置单位为 kb ,不会降低系统性能。至少也要配值为略大于系统上所有SGA的总和。
# 使用ulimit -l 来校验该设置
- 计算vm.nr_hugepages 的值
- Hugepagesize 使用default的2M即可
- Huge page的数量要设的合适大小,即 HugePages_Total * Hugepagesize > SGA 大小。按照SGA=350G,Hugepagesize=2M来计算的话,HugePages_Total应该要至少350*1024/2=179200。另外要加上ASM占用的部分(同样的算法)
- 数据库参数use_large_pages建议设成ONLY,即huge page必须配置足够大,全部SGA都能用到huge page,db instance才能启动,否则db instance不能启动
- Huge page的数量设太大也是没有用的,因为不会被其他进程用到
- 编辑/etc/sysctl.conf 来设置vm.nr_hugepages参数
#hugepage
vm.nr_hugepages=184320
$ sysctl -w vm.nr_hugepages=184320
$ sysctl -p
- 查看hugepage配置
grep Huge /proc/meminfo
AnonHugePages: 3172352 kB
HugePages_Total: 184320
HugePages_Free: 16072
HugePages_Rsvd: 10953
HugePages_Surp: 0
Hugepagesize: 2048 kB
- 启动数据库实例,观察alert日志