[20230214]访问asm相关视图缓慢的分析2.txt
--//前段时间做awr信息删除时看到的情况,当时主要精力放在整理awr信息上,对于遇到的问题放在一边,等到我想分析时,
--//crs crash,所以这篇文章许多内容来自事后的分析.
1.环境:
[email protected]:9014/ywdb> @ ver1
PORT_STRING VERSION BANNER
------------------------------ -------------- --------------------------------------------------------------------------------
x86_64/Linux 2.4.xx 11.2.0.4.0 Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
2.问题提出:
[email protected]:9014/ywdb> @ ashtop sql_id 1=1 &day
Total Distinct Distinct
Seconds AAS %This SQL_ID FIRST_SEEN LAST_SEEN Execs Seen Tstamps
--------- ------- ------- ------------- ------------------- ------------------- ---------- --------
279736 3.2 87% | f632bnsg7382q 2023-02-09 10:55:09 2023-02-10 08:40:00 192 1711
10363 .1 3% | 0sd3pynbkmuhv 2023-02-09 08:42:32 2023-02-10 08:34:01 284 7816
6740 .1 2% | 7crn7t9sw2u19 2023-02-10 08:11:45 2023-02-10 08:40:00 4 1685
5248 .1 2% | 2023-02-09 08:40:14 2023-02-10 08:39:52 2 4836
1685 .0 1% | 0ngfv7gr2cj5v 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
1685 .0 1% | 21uwuy3vy423u 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
1685 .0 1% | 6yw5r0xnwnc37 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
1685 .0 1% | fr9dygfy3az4d 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
1685 .0 1% | gvm0actu0bm6d 2023-02-10 08:11:45 2023-02-10 08:40:00 1 1685
517 .0 0% | 327ms8chj9552 2023-02-09 08:40:36 2023-02-10 08:39:42 517 514
...
30 rows selected.
--//主要查看sql_id=f632bnsg7382q的问题.
--//另外 7crn7t9sw2u19 0ngfv7gr2cj5v 21uwuy3vy423u 6yw5r0xnwnc37 fr9dygfy3az4d 也很奇怪仅仅执行1次,需要1685秒.
[email protected]:9014/ywdb> @ sql_id f632bnsg7382q
--SQL_ID = f632bnsg7382q
SELECT 'ASMGROUP' ROWINFO
, t.GROUP_NUMBER
, t.NAME
, t.ALLOCATION_UNIT_SIZE
, ROUND(t.TOTAL_MB/1024, 2)
, ROUND(t.FREE_MB/1024, 2), round(decode(TOTAL_MB, 0, 0,(t.TOTAL_MB-t.FREE_MB)/t.TOTAL_MB*100), 2) usedrate, t.STATE, t.TYPE, t.OFFLINE_DISKS, t.VOTING_FILES
FROM v$asm_diskgroup t;
--//原始代码一行,自己做了格式化处理.
--//查看后面7crn7t9sw2u19 0ngfv7gr2cj5v 21uwuy3vy423u 6yw5r0xnwnc37 fr9dygfy3az4d都是访问ask相关视图语句.猜测asm实例存在
--//问题.我当时2/9号手工执行执行sql_id=f632bnsg7382q语句出现挂起!!
--//其他访问asm视图也存在相似问题,例如:
[email protected]:9014/ywdb> @ sql_id 21uwuy3vy423u
--SQL_ID = 21uwuy3vy423u
select path from v$asm_disk;
[email protected]:9014/ywdb> @ sqlhh f632bnsg7382q 100
BEGIN_INTERVAL_TIME INST_ID SQL_ID PLAN_HASH_VALUE EXECUTIONS ELA_MS_PER_EXEC CPU_MS_PER_EXEC ROWS_PER_EXEC LIOS_PER_EXEC BLKRD_PER_EXEC IOW_MS_PER_EXEC AVG_IOW_MS CLW_MS_PER_EXEC APW_MS_PER_EXEC CCW_MS_PER_EXEC
------------------- ---------- ------------- --------------- ---------- --------------- --------------- ------------- ------------- -------------- --------------- ----------- --------------- --------------- ---------------
2023-02-06 05:00:24 2 f632bnsg7382q 3236948515 12 131 8 2.0 0 0 0 0.0 0 0 0
2023-02-06 07:00:03 2 f632bnsg7382q 3236948515 12 167 9 2.0 0 0 0 0.0 0 0 0
2023-02-07 02:00:07 2 f632bnsg7382q 3236948515 12 344 9 2.0 0 0 0 0.0 0 0 0
2023-02-08 06:00:47 2 f632bnsg7382q 3236948515 12 119 9 2.0 0 0 0 0.0 0 0 0
2023-02-09 02:00:10 2 f632bnsg7382q 3236948515 12 168 9 2.0 0 0 0 0.0 0 0 0
--//查询sqlhh实际上访问tpt/awr/awr_sqlstats_per_exec.sql脚本查询awr报表100天的相关信息(当然缺省没有这么长时间),可以发现
--//实际上执行很快ELA_MS_PER_EXEC最大344,单位毫秒.为什么呢?理论讲这样的情况根本不可能记录在awr相关表中或者讲当时实例真的
--//不忙!!
--//查看awr_sqlstats_per_exec.sql实际上访问dba_hist_snapshot,dba_hist_sqlstat视图.可以跟前面看到的ashtop信息完全对不上.
--//而且每小时执行12次,猜测是某个监测软件执行的,每5分钟执行1次.再仔细看发现有记录的是inst_id=2的信息,我当时的猜测在实例1
--//上的执行挂起了,这样才有可能解析我当前看到的情况.
--//另外我同事看了awr报表,在实例1的awr报表上确实看不到sql_id=f632bnsg7382q的相关信息.我自己也看了,确实没有.为什么不理解!!
--//实例2执行,很快完成:
[email protected]:9015/ywdb/ywdb2> select count(*) from v$asm_diskgroup t;
COUNT(*)
----------
2
--//实例1执行,hang:
[email protected]:9014/ywdb/ywdb1> select count(*) from v$asm_diskgroup t;
--//挂起!!
3.继续分析:
[email protected]:9014/ywdb/ywdb1> @ ashtop event,inst_id sql_id='f632bnsg7382q' &day
Total Distinct Distinct
Seconds AAS %This EVENT INST_ID FIRST_SEEN LAST_SEEN Execs Seen Tstamps
--------- ------- ------- ------------------------------------------ ---------- ------------------- ------------------- ---------- --------
279544 3.2 100% | 1 2023-02-10 08:26:01 2023-02-10 08:54:14 166 1684
20 .0 0% | 2 2023-02-09 11:05:17 2023-02-10 08:44:14 20 20
7 .0 0% | ASM file metadata operation 2 2023-02-09 10:55:09 2023-02-10 07:33:02 7 7
--//出现ASM file metadata operation等待,根本不是怎么主要等待事件.但是可以发现实例1 在执行时CPU占用100%,可以确定该类似语
--//句在实例1执行时挂起.
--//这个时候我突然以前测试过连接数据库的进程在访问asm相关视图时会启动一个访问asm实例的进程,然后将相关信息返回给数据库进
--//程.也就是讲问题出在访问asm的连接上.等到我想看asm实例时,crs down掉.无法继续分析...
4.事后分析:
--//数据库已经重启!只能查询dba_hist_ACTIVE_SESS_HISTORY视图.
SELECT min(sql_exec_id),max(sql_exec_id)
FROM dba_hist_ACTIVE_SESS_HISTORY
where sql_id = 'f632bnsg7382q' and instance_number=1
MIN(SQL_EXEC_ID) MAX(SQL_EXEC_ID)
---------------- ----------------
16777216 16777381
--//一共产生166,16777381-16777216+1 = 166.
SELECT sql_exec_id,min(sample_time),client_id
FROM dba_hist_ACTIVE_SESS_HISTORY
where sql_id = 'f632bnsg7382q' and instance_number=1 and sql_exec_id between 16777216 and 16777381
group by sql_exec_id,client_id
order by sql_exec_id
SQL_EXEC_ID MIN(SAMPLE_TIME) CLIENT_ID
----------- ----------------------- ---------------
16777216 2023-02-03 09:02:59.411 192.168.101.217
16777217 2023-02-03 09:02:59.411 192.168.101.217
...
16777370 2023-02-03 09:02:59.411 192.168.101.217
16777371 2023-02-03 09:02:59.411 192.168.101.227
16777372 2023-02-03 09:02:59.411 192.168.101.227
16777373 2023-02-03 09:02:59.411 192.168.101.227
16777374 2023-02-03 09:02:59.411 192.168.101.227
16777375 2023-02-03 09:02:59.411 192.168.101.227
16777376 2023-02-03 09:02:59.411 192.168.101.227
16777377 2023-02-03 09:02:59.411 192.168.101.227
16777378 2023-02-03 09:02:59.411 192.168.101.227
16777379 2023-02-03 09:02:59.411 192.168.101.227
16777380 2023-02-03 09:02:59.411 192.168.90.217
16777381 2023-02-09 09:20:24.302 192.168.98.XX
166 rows selected.
--//有点奇怪的是2023-02-03 09:02:59.411时间点有165个会话同时执行.IP来自192.168.101.217,192.168.101.227,192.168.90.217
--//主要来自192.168.101.217.
--//注:最后1个IP=192.168.98.XX是我发现问题手工执行的.
--//问了一下3个IP来自同一个主机,做了负载均衡.
--//我开始以为是我做awr信息删除导致的问题(当时在做这方面操作),再仔细分析不可能!!因为不大可能一个IP同一个时间点执行相同的
--//sql语句,也就是因为执行时挂起,应该是慢慢累积的,再早的awr信息已经删除,已经无法确定出现问题的最早时间点.
--//有165个session一直在运行这条语句在实例1,从时间点看比2023-02-03 09:02:59.411还有早已经出现问题,为什么以后就不再有相同
--//sql语句继续执行呢?难道因为负载问题移到实例2执行了吗?
--//找到一个链接:https://www.cnblogs.com/jyzhao/p/8379720.html,感觉我遇到的问题跟他很相似.
--//我估计在查询asm视图时出现挂起,出现阻塞现象.
--//实际上在看链接https://www.cnblogs.com/jyzhao/p/8379720.html时,我执行如下:
$ crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
$ crsctl check cluster -all
^C
--//等很久没有出现结果,按ctrl+c中断!!估计实际上已经出现问题了.我还没来得及看链接的具体内容,crs down了.
--//也许问题还会再次出现,只能等待下次出现再做分析了!!
--//发现自己不了解dba_hist_active_sess_history视图,sql_exec_start字段可以确定执行的开始时间.
SELECT DISTINCT sql_exec_id, sql_exec_start
FROM dba_hist_ACTIVE_SESS_HISTORY
WHERE sql_id = 'f632bnsg7382q'
AND instance_number = 1
AND sql_exec_id BETWEEN 16777216 AND 16777381
ORDER BY sql_exec_start;
SQL_EXEC_ID SQL_EXEC_START
----------- -------------------
16777216 2022-10-08 18:37:03
16777217 2022-10-08 18:47:04
16777218 2022-10-08 18:57:03
16777219 2022-10-08 19:07:04
16777220 2022-10-08 19:17:06
16777221 2022-10-08 19:27:06
16777222 2022-10-08 19:37:08
16777223 2022-10-08 19:47:09
16777224 2022-10-08 19:57:09
16777225 2022-10-08 20:07:10
16777226 2022-10-08 20:17:12
16777227 2022-10-08 20:27:12
16777228 2022-10-08 20:37:13
16777229 2022-10-08 20:47:15
16777230 2022-10-08 20:57:16
16777231 2022-10-08 21:07:18
16777232 2022-10-08 21:17:20
16777233 2022-10-08 21:27:21
16777234 2022-10-08 21:37:22
16777235 2022-10-08 21:47:23
16777236 2022-10-08 21:57:24
16777237 2022-10-08 22:07:26
16777238 2022-10-08 22:17:28
16777239 2022-10-08 22:27:30
16777240 2022-10-08 22:37:30
16777241 2022-10-08 22:47:31
16777242 2022-10-08 22:57:34
16777243 2022-10-08 23:07:35
16777244 2022-10-08 23:17:37
16777245 2022-10-08 23:27:39
16777246 2022-10-08 23:37:40
16777247 2022-10-08 23:47:42
16777248 2022-10-08 23:57:51
16777249 2022-10-09 00:07:46
16777250 2022-10-09 00:17:47
16777251 2022-10-09 00:27:50
16777252 2022-10-09 00:37:51
16777253 2022-10-09 00:47:53
16777254 2022-10-09 00:57:56
16777255 2022-10-09 01:07:59
16777256 2022-10-09 01:18:00
16777257 2022-10-09 01:28:02
16777258 2022-10-09 01:38:04
16777259 2022-10-09 01:48:06
16777260 2022-10-09 01:58:08
16777261 2022-10-09 02:18:12
16777262 2022-10-09 02:28:14
16777263 2022-10-09 02:38:17
--// 时间不再连续...
16777264 2022-10-09 08:08:27
16777265 2022-10-09 11:28:53
16777266 2022-10-09 11:48:53
16777267 2022-10-09 15:40:13
16777268 2022-10-09 15:50:13
16777269 2022-10-09 16:10:12
16777270 2022-10-09 16:20:13
16777271 2022-10-09 16:30:17
16777272 2022-10-09 16:40:17
16777273 2022-10-09 16:50:22
16777274 2022-10-09 17:10:24
16777275 2022-10-09 17:20:26
16777276 2022-10-09 17:30:31
16777277 2022-10-09 17:50:33
16777278 2022-10-09 18:00:36
16777279 2022-10-09 18:20:50
16777280 2022-10-09 18:30:41
16777281 2022-10-09 18:40:44
16777282 2022-10-09 18:50:45
16777283 2022-10-09 19:00:48
16777284 2022-10-09 19:10:52
16777285 2022-10-09 19:20:54
16777286 2022-10-09 19:31:09
16777287 2022-10-09 19:40:59
16777288 2022-10-09 19:51:01
16777289 2022-10-09 20:01:07
16777290 2022-10-09 20:11:07
16777291 2022-10-09 20:21:10
16777292 2022-10-09 20:31:13
16777293 2022-10-09 20:41:15
16777294 2022-10-09 20:51:30
16777295 2022-10-09 21:01:22
16777296 2022-10-09 21:11:25
16777297 2022-10-09 21:21:40
16777298 2022-10-09 21:31:32
16777299 2022-10-09 21:41:32
16777300 2022-10-09 21:51:35
16777301 2022-10-09 22:01:53
16777302 2022-10-09 22:11:42
16777303 2022-10-09 22:21:58
16777304 2022-10-09 22:32:31
16777305 2022-10-09 22:43:09
16777306 2022-10-09 22:53:40
16777307 2022-10-09 23:04:12
16777308 2022-10-09 23:14:43
16777309 2022-10-09 23:25:05
16777310 2022-10-09 23:35:27
16777311 2022-10-09 23:46:07
16777312 2022-10-09 23:59:12
16777313 2022-10-10 00:09:27
16777314 2022-10-10 00:19:28
16777315 2022-10-10 00:29:16
16777316 2022-10-10 00:39:32
16777317 2022-10-10 00:49:33
16777318 2022-10-10 00:59:21
16777319 2022-10-10 01:09:41
16777320 2022-10-10 01:19:43
16777321 2022-10-10 01:29:31
16777322 2022-10-10 01:39:45
16777323 2022-10-10 01:49:49
16777324 2022-10-10 01:59:52
16777325 2022-10-10 02:09:53
16777326 2022-10-10 02:19:54
16777327 2022-10-10 02:29:56
16777328 2022-10-10 02:39:58
16777329 2022-10-10 02:50:01
16777330 2022-10-10 03:00:01
16777331 2022-10-10 03:10:03
16777332 2022-10-10 03:20:07
16777333 2022-10-10 03:30:09
16777334 2022-10-10 03:40:12
16777335 2022-10-10 03:50:13
16777336 2022-10-10 04:00:16
16777337 2022-10-10 04:10:18
16777338 2022-10-10 04:20:19
16777339 2022-10-10 04:30:21
16777340 2022-10-10 04:40:22
16777341 2022-10-10 04:50:25
16777342 2022-10-10 05:00:32
16777343 2022-10-10 05:10:33
16777344 2022-10-10 05:20:33
16777345 2022-10-10 05:30:34
16777346 2022-10-10 05:40:39
16777347 2022-10-10 05:50:41
16777348 2022-10-10 06:00:44
16777349 2022-10-10 06:10:45
16777350 2022-10-10 06:20:49
16777351 2022-10-10 06:30:52
16777352 2022-10-10 06:40:53
16777353 2022-10-10 06:50:55
16777354 2022-10-10 07:01:00
16777355 2022-10-10 07:11:02
16777356 2022-10-10 07:21:05
16777357 2022-10-10 07:31:08
16777358 2022-10-10 07:41:12
16777359 2022-10-10 07:51:15
16777360 2022-10-10 08:01:07
16777361 2022-10-10 08:11:11
16777362 2022-10-10 08:21:30
16777363 2022-10-10 08:31:33
16777364 2022-10-10 08:41:38
16777365 2022-10-10 08:51:52
16777366 2022-10-10 09:02:31
16777367 2022-10-10 09:13:51
16777368 2022-10-10 09:30:25
16777369 2022-10-10 09:59:14
16777370 2022-10-10 11:34:56
16777371 2022-10-17 09:06:15
16777372 2022-10-17 09:13:06
16777373 2022-10-17 09:23:08
16777374 2022-10-17 09:23:09
16777375 2022-10-17 09:26:17
16777376 2022-10-17 09:32:48
16777377 2022-10-17 09:43:08
16777378 2022-10-17 09:53:08
16777379 2022-10-17 10:03:09
16777380 2023-01-30 16:39:02
16777381 2023-02-09 09:20:13
166 rows selected.
--//10分钟1次.1小时6次.而前面sqlhh看到每小时12次.难道正常情况下另外一个实例每小时也执行6次.
--//出现负载均衡问题,导致后面的许多执行发生在实例2吗?
--//最早出现的时间点是2022-10-08 18:37:03.到现在才发现,不知道如何管理数据库的.
--//转抄链接的内容,也许以后会用上:
有了上面的信息,再次查询MOS就可以匹配到 - Diskgroup Mount Hangs with RBAL Waiting on 'GPnP Get Item' and 'enq: DD -
contention' (文档 ID 1375505.1) - "crsctl check cluster -all" command gives CRS-4404, CRS-4405 errors (文档 ID
1392934.1)
参照MOS解决方案:
Use the following command as root on Compute Node 3 to identify the current gpnpd.bin process:
ps -ef | grep gpnpd.bin
This will give you an output similar to:
ps -ef | grep pnp
oracle 31422 1 0 2011 ? 00:05:40 /u01/app/11.2.0.2/grid/bin/gpnpd.bin
Please note the process ID number of the current gpnpd.bin process. In this example, = 31422
Use the following command on Compute Node 3 to stop the current gpnpd.bin process:
kill -HUP
for example, kill -HUP 31422
--//补充.
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR
31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3
38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7
58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX
--//注:HUP 相当于1.这里实际的执行效果相当于-9参数.
After a few moments, gpnpd.bin should be automatically restarted by GI, run the command from step # 1 to verify that a
new gpnpd.bin process exists