This is the first part in a series of articles entitled "The right tool for the right job". In each article, I will use tools and ideas made available to us by oracle gurus like Tanel Poder, Jonathan Lewis, Tom Kyte, Kerry Osborne and many others. I will try to show you how to use those available free tools for troubleshooting real life problem I've encountered at client sites. You can use the scripts "as-is" (note that you should always test any script before using it on production systems) or modify them to suit your needs or even write your own based on ideas found on original scripts.
Some background first, database is 10.2.0.4 running in a SUN M9000 dynamic domain with 128Go of memory and 32 CPU (2520 MHz) allocated using fair share scheduler (FSS). We are experiencing high "latch: cache buffers chains" (CBC) contention issue. The problem occurs intermittently (several times a day) and also resolves by itself. Here is an overview of the issue with Quest Perfomance Analysis (as the client pays for it, let's use it to show a pretty graph ^_^) :
CBC latches are represented in pink color, I did the whole analysis during 10:00 and 10:15 AM, when the problem was currently happening.
What is a latch?
Latches are simple, low-level serialization mechanisms to protect shared SGA data structures and shared code segments (such as various internal linked list modifications, shared pool memory allocation, library cache object lookups and so on...) from simultaneous session access. They are designed to be very quickly acquired and freed. Latches are very low-level locks, inaccessible to users or applications as they cannot directly acquire nor release them. They are similar in purpose to locks: Latches protect internal memory structures while locks protect data structures.
The database buffer cache is the portion of the SGA that holds copies of data blocks read from datafiles so they can be accessed faster than by reading them off disks.
We can usually get more information about wait events by checking P1, P2 and P3. We will then look up on V$EVENT_NAME what P1, P2 and P3 mean for "latch: cache buffers chains".
|
From the above query, P1 is the address of the latch for the cbc latch wait. Now, we can query ASH by grouping the CBC latch waits by the address and find out what address is experiencing the most contention (=waits).
SQL> select * from ( 2 select 3 event, 4 trim(to_char(p1, 'XXXXXXXXXXXXXXXX')) latch_addr, 5 trim(round(ratio_to_report(count(*)) over () * 100, 1))||'%' pct, 6 count(*) 7 from 8 v$active_session_history 9 where 10 event = 'latch: cache buffers chains' 11 and session_state = 'WAITING' 12 group by event,p1 13 order by count(*) desc 14 ) 15 where rownum <= 10; EVENT LATCH_ADDR PCT COUNT(*) -------------------------------------------------- ----------------- -------- ---------- latch: cache buffers chains 967757968 78.8% 178 latch: cache buffers chains 964F892C0 2.7% 6 latch: cache buffers chains CE2984B50 2.2% 5 latch: cache buffers chains 9676FC808 1.3% 3 latch: cache buffers chains 963019308 1.3% 3 latch: cache buffers chains 963DE63F0 .9% 2 latch: cache buffers chains 9655A1E38 .9% 2 latch: cache buffers chains 966F65E28 .9% 2 latch: cache buffers chains 963276CE8 .9% 2 latch: cache buffers chains CE1AD8E38 .9% 2 10 rows selected. SQL> |
Looking at the highlighted line, 78.8% of the waits on "latch: cache buffers chains" involves child latch at address 967757968.
Now that we know the actual CBC child latch address having the most waits, we can use Tanel Poder's latchprofx script (http://files.e2sn.com/scripts/latchprofx.sql) to find the root cause of the contention. We will monitor the holders (sid and sql_id) of this particular child latch. The column called "object", only available in x$ksuprlat shows information about the object protected by a given latch. For cache buffers chains latches, this column shows the Data Block Address (DBA) of the block that we accessed, causing the latch get.
|
The block address accessed is 31B7E54 and the sessions holding the child latch the most were executing the query with sql_id amnf5uaxyn12c.
Next, we will convert that data block address to file#, block# using dbms_utility and try to find to which segment it belongs querying x$bh (bh stands for blocks headers). Once again, Tanel wrote a script for that (dba.sql standing for Data Block Address^^). You will find it in tpt_public_unixmac.tar.gz available on Tanel's new website.
|
To solve the problem once and for all, we must find the SQL and why is application hitting the block so hard.
We can get sql_fulltext of a given sql_id either from V$SQL or V$SQLAREA or V$SQLSTATS. I personally prefer getting it from V$SQLSTATS because it is faster, more scalable, and has a greater data retention (the statistics may still appear in this view, even after the cursor has been aged out of the shared pool).
I have a little script for that : sqltext.sql
|
If you have T.O.A.D, one useful feature is SQL formatting tool (don't get me wrong, I'm no big fan of GUI tools, I use SQL*PLUS 99% of the time ;-)
|
Let's have a look at the execution plan of this query using dbms_xplan.display_cursor:
|
The operation connect by with filtering (lines 16 to 22 of the plan) is used to process hierarchical queries. It is characterized by two child operations. The first one is used to get the root of the hierarchy, and the second one is executed once for each level in the hierarchy. Each of those two child operations includes an index range scan of itrn2tau.
Looking at lines 4 to 12 of the plan, it is assuming that the result of the nested loops (lines 5 to 9) between ttrntau (accessing index itrn2tau) and thdrpdttau will return one row. It is also assuming that accessing table ttrntau by index itrn2tau will return 1 row as well (line 11-12). Then it just puts those two rows together with a cartesian product and joins that result to table ttrntau accessed by index itrn2tau (line 13-14) with a filter operation (line 3). It is because of the low cardinality expected that it went using a merge join cartesian. It is a perfectly valid approach and is very fast here.
The filter operation used for last join is very similar to a nested loop join - for each row in the driving table, Oracle operates the filter condition to determine whether or not to keep a row. However, filter can be much more efficient than nested loop, because it can remember results of previous probes into the second table - effectively making the probe an in-memory lookup.
As seen previously, for a single query execution, we access index itrn2tau several times. Furthermore, this query was executed concurrently by many sessions, that is why the latches were hammered so much. This is all about scalability, the query is running fine on a single user system but does not scale when run concurrently by hundreds of users.
Fortunately, this query was in-house developed so I asked the dev team to work on it.
标签:00,4.425,Fix,cache,T2,T3,T1,high From: https://blog.51cto.com/u_15794314/5891288