首页 > 其他分享 >10.2.0.4 RAC hang住的分析-从ash基表数据查出RLM$EVTCLEANUP job异常

10.2.0.4 RAC hang住的分析-从ash基表数据查出RLM$EVTCLEANUP job异常

时间:2023-02-25 17:32:15浏览次数:47  
标签:RLM 692 10.2 基表 lock sample WAITING session time


记录一下之前的一次故障排查过程。
用户反映8点左右业务部分反映应用卡,8点多以后应用无法进行,用户DBA紧急重启了数据库实例,之后数据库系统恢复正常。
了解到这是一套AIX+两节点10.2.0.4版本RAC集群,运行着用户的最核心的业务;
重启前用户没有时间做hang dump之类操作,查看alert日志,也没有发现有相应的系统后台进程产生的hang相关trace;
这种hang住情况下,用户介入处理时没有做hang dump/systemdump,ASH信息在HANG的情况下是可能未收集的;重启后的数据库实例,能留给我们故障排查的信息可能较为有限。

在收集了重启前的AWR/ASH报告后,同时收集了数据库主机的NMON监控信息,查看了集群的相关日志(集群alert/cssd等),将ASH数据导入到了测试环境进行分析。
通过对这些信息的分析,系统负载无异常,从AWR/ASH报告中无法准确判定阻塞源头的会话或SQL信息;最终通过对ASH基表数据中各会话之间等待链的分析,找到了阻塞的根源是:节点1上运行的定时任务执行的是RLM$EVTCLEANUP;

通过MOS文档 ID 444766.1上提供的方法:@$ORACLE_HOME/rdbms/admin/catnoexf.sql关闭此任务后,后续观察一段时间是,未出现此问题。
当然这个系统还有SGA自动管理时频繁RESIZE,大量的KGH: NO ACCESS状态SGA内存;同时RAC SEQUENCE问题比较严重,大量的SEQUENCE CACHE是默认的20;
后续也陆续进行了沟通与调整,就不在本篇多说了。

 

如下是分析过程及使用到分析ASH基表数据的脚本: 1.数据库实例的Alert日志

节点1
Wed May 25 07:49:41 2016
LNS: Standby redo logfile selected for thread 1 sequence 101862 for destination LOG_ARCHIVE_DEST_2
Wed May 25 08:20:12 2016
Errors in file /oracle/admin/bysclm/bdump/bysclm1_j000_40697948.trc:
ORA-12012: error on auto execute of job 46500
ORA-27468: "EXFSYS.RLM$SCHDNEGACTION" is locked by another process
Wed May 25 08:26:11 2016
Thread 1 advanced to log sequence 101863 (LGWR switch)
Current log# 2 seq# 101863 mem# 0: +V5DATA/bysclm/onlinelog/group_2.9855.799568879

节点2在8点20左右出现大量ORA-3136,之后用户手动进行了重启
Wed May 25 07:51:13 2016
LNS: Standby redo logfile selected for thread 2 sequence 87896 for destination LOG_ARCHIVE_DEST_2
Wed May 25 08:20:16 2016
WARNING: inbound connection timed out (ORA-3136)
……………………
Wed May 25 08:20:16 2016
WARNING: inbound connection timed out (ORA-3136)
Wed May 25 08:20:16 2016
WARNING: inbound connection timed out (ORA-3136)
Wed May 25 08:25:07 2016
Thread 2 advanced to log sequence 87897 (LGWR switch)
Current log# 6 seq# 87897 mem# 0: +V5DATA/bysclm/onlinelog/group_6.9873.799568001

2.AWR/ASH分析 DB time很高,TOP等待中有大量enq: SQ – contention、DFS lock handle; 但是这些只能判断当时系统慢,并无助于确定到底是哪一个进程或多个进程执行的哪种操作,导致了其它操作被阻塞、HANG住。 3.ASH基表数据的分析 同时使用如下方法导出了ASH的基表数据,同时将ASH数据导入到了测试环境进行分析:

create table M_ASH as select * from dba_hist_active_sess_history where sample_time >to_timestamp('2016-05-31 08:00:00','yyyy-mm-dd hh24:mi:ss')
and sample_time <to_timestamp('2016-05-31 08:40:00','yyyy-mm-dd hh24:mi:ss');

$ exp aa/bb file=ashdata.dmp tables='M_ASH'
$ imp \'sys/ as sysdba\' full=y DATA_ONLY=y file=m_ash_531.dmp

首先通过如下脚本,查询指定时间段的等待链。 其原理为通过dba_hist_active_sess_history. blocking_session记录的holder来通过connect by级联查询,找出最终的holder. 在RAC环境中,每个节点的ASH采样的时间很多情况下并不是一致的,因此可以通过将本SQL的第二段注释的sample_time稍作修改让不同节点相差1秒的采样时间可以比较(注意最好也将partition by中的sample_time做相应修改)。该输出中isleaf=1的都是最终holder,而iscycle=1的代表死锁了(也就是在同一个采样点中a等b,b等c,而c又等a,这种情况如果持续发生,那么尤其值得关注)。 采用如下查询SQL能观察到阻塞链。--注意脚本中使用的表:M_ASH是基于dba_hist_active_sess_history创建的,注意表名。 --如果用于11G,可以适当修改

select /*+ parallel 2 */ level lv, connect_by_isleaf isleaf, connect_by_iscycle iscycle,
t.sample_id, t.sample_time, t.session_id, t.sql_id, t.session_type, t.event, t.session_state,
/*10G无此字段t.blocking_inst_id, */t.blocking_session, t.blocking_session_status
from M_ASH t
where sample_time >to_timestamp('2016-05-31 08:10:00','yyyy-mm-dd hh24:mi:ss')
and sample_time <to_timestamp('2016-05-31 08:20:00','yyyy-mm-dd hh24:mi:ss')
start with blocking_session is not null
connect by nocycle
prior dbid = dbid
and prior sample_time = sample_time
/*and ((prior sample_time) - sample_time between interval '-1'
second and interval '1' second)*/
/*10G无此字段 and prior blocking_inst_id = instance_number*/
and prior blocking_session = session_id
and prior blocking_session_serial# = session_serial#
order siblings by dbid, sample_time;
----------------
输出如下:--从输出可以看到(部分没有重要信息的列这里删除了),这些会话有多重阻塞关系,参考1、2列、BLOCKING_SESSION列信息分析:
--如开始的四行阻塞关系是:572===>>523===>>747===>>692,即阻塞根源是692号会话。
LV ISLEAF SESSION_ID SQL_ID EVENT SESSION BLOCKING_SESSION BLOCKING_SE
--- ------ - ---------- ------------- ----------------------------- ------- ---------------- -----------
1 0 572 78z81r8v6v59w enq: TX - row lock contention WAITING 523 VALID
2 0 523 aa0tzygm25bus enq: SQ - contention WAITING 747 VALID
3 0 747 gjztzxjnpyhsf DFS lock handle WAITING 692 VALID
4 1 692 DFS lock handle WAITING GLOBAL
1 0 684 6kvkh34hxtakk enq: TX - row lock contention WAITING 523 VALID
2 0 523 aa0tzygm25bus enq: SQ - contention WAITING 747 VALID
3 0 747 gjztzxjnpyhsf DFS lock handle WAITING 692 VALID
4 1 692 DFS lock handle WAITING GLOBAL
1 0 639 3y0mvtrvyzpjz enq: SQ - contention WAITING 653 VALID
2 0 653 3y0mvtrvyzpjz DFS lock handle WAITING 692 VALID
3 1 692 DFS lock handle WAITING GLOBAL
^^^^
1 0 695 3phsw3q3hxzjc enq: SQ - contention WAITING 758 VALID
2 0 758 3phsw3q3hxzjc DFS lock handle WAITING 692 VALID
3 1 692 DFS lock handle WAITING GLOBAL
1 0 732 3phsw3q3hxzjc enq: SQ - contention WAITING 758 VALID
2 0 758 3phsw3q3hxzjc DFS lock handle WAITING 692 VALID
3 1 692 DFS lock handle WAITING GLOBAL
如下SQL只查出阻塞源头的会话信息:
select * from (select /*+ parallel 2 */ level lv, connect_by_isleaf isleaf, connect_by_iscycle iscycle,
t.dbid, t.sample_id, t.sample_time, t.session_id, t.sql_id, t.session_type, t.event, t.session_state,
/*10G无此字段t.blocking_inst_id, */t.blocking_session, t.blocking_session_status
from m_ash t
where sample_time >to_timestamp('2016-05-31 08:00:00','yyyy-mm-dd hh24:mi:ss')
and sample_time <to_timestamp('2016-05-31 08:40:00','yyyy-mm-dd hh24:mi:ss')
start with blocking_session is not null
connect by nocycle
prior dbid = dbid
and prior sample_time = sample_time
/*and ((prior sample_time) - sample_time between interval '-1'
second and interval '1' second)*/
/*10G无此字段 and prior blocking_inst_id = instance_number*/
and prior blocking_session = session_id
and prior blocking_session_serial# = session_serial#
order siblings by dbid, sample_time)
where isleaf=1;
--------------
LV ISLEAF SAMPLE_TIME SESSION_ID EVENT SESSION BLOCKING_SESSION BLOCKING_SE
--- ------ ------------------------- ---------- ------------------- ------- ---------------- -----------
4 1 31-MAY-16 08.00.07.547 AM 692 DFS lock handle WAITING GLOBAL
4 1 31-MAY-16 08.00.07.547 AM 692 DFS lock handle WAITING GLOBAL
2 1 31-MAY-16 08.00.07.547 AM 692 DFS lock handle WAITING GLOBAL
2 1 31-MAY-16 08.00.07.547 AM 692 DFS lock handle WAITING GLOBAL
4 1 31-MAY-16 08.00.07.547 AM 692 DFS lock handle WAITING GLOBAL
4 1 31-MAY-16 08.00.07.547 AM 692 DFS lock handle WAITING GLOBAL
4 1 31-MAY-16 08.00.07.547 AM 692 DFS lock handle WAITING GLOBAL
3 1 31-MAY-16 08.00.07.547 AM 692 DFS lock handle WAITING GLOBAL
^^^
3 1 31-MAY-16 08.29.39.237 AM 692 DFS lock handle WAITING GLOBAL
3 1 31-MAY-16 08.29.39.237 AM 692 DFS lock handle WAITING GLOBAL

如下SQL只查出阻塞源头的会话信息(统计多个session_id阻塞的次数):
select session_id,count(*) from (select * from (select /*+ parallel 2 */ level lv, connect_by_isleaf isleaf, connect_by_iscycle iscycle,
t.sample_time, t.session_id, t.sql_id, t.session_type, t.event, t.session_state,
/*10G无此字段t.blocking_inst_id, */t.blocking_session, t.blocking_session_status
from m_ash t
where sample_time >to_timestamp('2016-05-31 08:00:00','yyyy-mm-dd hh24:mi:ss')
and sample_time <to_timestamp('2016-05-31 08:40:00','yyyy-mm-dd hh24:mi:ss')
start with blocking_session is not null
connect by nocycle
prior dbid = dbid
and prior sample_time = sample_time
/*and ((prior sample_time) - sample_time between interval '-1'
second and interval '1' second)*/
/*10G无此字段 and prior blocking_inst_id = instance_number*/
and prior blocking_session = session_id
and prior blocking_session_serial# = session_serial#
order siblings by dbid, sample_time)
where isleaf=1)
group by session_id order by 2 desc;

SESSION_ID COUNT(*)
---------- ----------
692 9274
521 12
626 12
601 8
749 6
638 2
643 2
802 1
522 1
547 1
806 1
538 1

12 rows selected.

4.查询阻塞会话的信息并处理:关闭RLM$EVTCLEANUP 至此基本可以确认是692号会话阻塞了其它的会话,是阻塞的源头; 接下来查一下此会话的信息:

SQL> col program for a30
SQL> select instance_number,session_id,sql_id,program,module,action from m_ash where session_id=692;

INSTANCE_NUMBER SESSION_ID SQL_ID PROGRAM MODULE ACTION
--------------- ---------- ------------- ------------------------------ -------------------- --------------------
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP
1 692 (J000) DBMS_SCHEDULER RLM$EVTCLEANUP

可以看到是在节点1上运行的定时任务,执行的是RLM$EVTCLEANUP; 关于RLM$EVTCLEANUP的信息可以参考MOS文档:ID 444766.1

These jobs under EXFSYS schema are related to the Expression Filter feature in 10G.
The Expression Filter feature can be installed on an Oracle10G Standard or Enterprise Edition database.
It is supplied as a set of PL/SQL packages, a Java package, a set of dictionary tables, and catalog views.
All these objects are created in a dedicated schema named EXFSYS.
This schema is not part of the typical installation of the database and will not exist before the Expression Filter is installed.

You can disable them by uninstalling the EXFSYS schema as follows:
1. Login as SYSDBA
2. @$ORACLE_HOME/rdbms/admin/catnoexf.sql</span>

EXFSYS是Oracle Expression Filter 组件的owner,根据上面MOS文档的建议,可以卸载该组件:关闭此任务后,后续观察一段时间是,未再出现此问题。 当然这个系统还有SGA自动管理时频繁RESIZE,大量的KGH: NO ACCESS状态SGA内存;同时RAC SEQUENCE问题比较严重,大量的SEQUENCE CACHE是默认的20; 后续也陆续进行了沟通与调整,就不在本篇多说了。

标签:RLM,692,10.2,基表,lock,sample,WAITING,session,time
From: https://blog.51cto.com/u_1652968/6085624

相关文章