1. 事故描述
昨日上午十点多,我们的基础应用发生生产事故。具体表象为系统出现假死无响应。
2. 硬件 LB
查看硬件负载路由情况:
ARRAY-3(config)#sh stati sl r tcp JiChuYinYong_a_8001
Real service JiChuYinYong_a_8001 192.168.1.137 8001 UP ACTIVE
Main health check: 192.168.1.137 8001 tcp UP
Max Conn Count: 10000
Current Connection Count: 1685
Outstanding Request Count: 1685
Total Hits: 13532432
Total Bytes In: 1822052478897
Total Bytes Out: 387396621959
Total Packets In: 2993154102
Total Packets Out: 1973275757
Average Response time: 1.165 ms
ARRAY-3(config)#sh stati sl r tcp JiChuYinYong_b_8001
Real service JiChuYinYong_b_8001 192.168.1.138 8001 UP ACTIVE
Main health check: 192.168.1.138 8001 tcp UP
Max Conn Count: 10000
Current Connection Count: 19
Outstanding Request Count: 19
Total Hits: 3408819
Total Bytes In: 658529070959
Total Bytes Out: 165733759282
Total Packets In: 1177272070
Total Packets Out: 796176948
Average Response time: 1.175 ms
两台路由情况
1685 : 19,硬件负载没有将流量均衡负载。出问题的是 137 那台机器。
3. weblogic 日志
3.1 问题的制造者
weblogic 日志很多,暴露出来的问题很多。但是很多时候并非所有暴露出来的问题都是问题:
• 有一些问题在低并发时没问题,一到高并发就出问题。这种问题是问题的制造者。
• 有一些问题在低并发时没问题,高并发时本来也没有问题,但是由于系统资源被其它问题耗光,这种问题也就“暴露”出来了。这种问题其实是问题的受害者。
作为系统管理人员,我们需要具备在大量的问题中,敏锐地找出问题的制造者,直指根源的能力。
查看出问题那台服务器的 weblogic 日志,在出故障的时间段内,最先出现问题的日志有以下信息:
####<Mar 25, 2017 9:29:09 AM CST> <Error> <WebLogicServer> <PSFPWEB01> <psfp_in> <[ACTIVE] ExecuteThread: '52' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1490405349068> <BEA-000337> <[STUCK] ExecuteThread: '18' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "686" seconds working on the request "weblogic.servlet.internal.ServletRequestImpl@2308bcd6[
POST /psfp-inagency/resource/com.defonds.core.psfp.inagency.biz.issued.resource.InBatchIssuedResource/issued HTTP/1.1
Content-Type: application/json
Cache-Control: no-cache
Pragma: no-cache
User-Agent: Java/1.6.0_45
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Content-Length: 472434
]", which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace:
Thread-411 "[STUCK] ExecuteThread: '18' for queue: 'weblogic.kernel.Default (self-tuning)'" <alive, in native, suspended, priority=1, DAEMON> {
jrockit.net.SocketNativeIO.readBytesPinned(SocketNativeIO.java:???)
jrockit.net.SocketNativeIO.socketRead(SocketNativeIO.java:24)
java.net.SocketInputStream.socketRead0(SocketInputStream.java:???)
java.net.SocketInputStream.read(SocketInputStream.java:107)
java.io.BufferedInputStream.fill(BufferedInputStream.java:189)
java.io.BufferedInputStream.read(BufferedInputStream.java:236)
^-- Holding lock: java.io.BufferedInputStream@47fee815[thin lock]
weblogic.net.http.MessageHeader.isHTTP(MessageHeader.java:214)
weblogic.net.http.MessageHeader.parseHeader(MessageHeader.java:141)
weblogic.net.http.HttpClient.parseHTTP(HttpClient.java:460)
weblogic.net.http.HttpURLConnection.getInputStream(HttpURLConnection.java:328)
weblogic.net.http.SOAPHttpURLConnection.getInputStream(SOAPHttpURLConnection.java:37)
^-- Holding lock: weblogic.net.http.SOAPHttpURLConnection@5971479e[thin lock]
weblogic.net.http.HttpURLConnection.getResponseCode(HttpURLConnection.java:939)
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:158)
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
com.sun.jersey.api.client.Client.handle(Client.java:645)
com.sun.jersey.api.client.WebResource.handle(WebResource.java:679)
com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:558)
com.defonds.rest.core.client.proxy.ResourceJsonInvocationHandler.invoke(ResourceJsonInvocationHandler.java:33)
$Proxy437.issued(Unknown Source)
com.defonds.core.psfp.inagency.biz.issued.resource.impl.BatchIssuedResourceImpl.issued(BatchIssuedResourceImpl.java:63)
sun.reflect.GeneratedMethodAccessor4368.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:575)
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:183)
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:185)
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:93)
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:80)
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:93)
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:72)
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1495)
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1441)
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1386)
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1377)
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:378)
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:538)
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:637)
javax.servlet.http.HttpServlet.service(HttpServlet.java:815)
weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.java:224)
weblogic.servlet.internal.StubSecurityHelper.invokeServlet(StubSecurityHelper.java:108)
weblogic.servlet.internal.ServletStubImpl.execute(ServletStubImpl.java:206)
weblogic.servlet.internal.TailFilter.doFilter(TailFilter.java:26)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
com.defonds.core.psfp.common.web.filter.LoggingProcessFilter.doFilterInternal(LoggingProcessFilter.java:31)
org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:89)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:82)
org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:89)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.doIt(WebAppServletContext.java:3656)
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3650)
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:308)
weblogic.security.service.SecurityManager.runAs(SecurityManager.java:117)
weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2204)
weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2126)
weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1412)
weblogic.work.ExecuteThread.execute(ExecuteThread.java:198)
weblogic.work.ExecuteThread.run(ExecuteThread.java:165)
}
这个 Thread-411 是第一个粘滞线程,相关业务代码如下:
com.defonds.core.psfp.inagency.biz.issued.resource.impl.BatchIssuedResourceImpl.issued(BatchIssuedResourceImpl.java:63)
前三个粘滞线程都是这种。后续带有同样表述的粘滞线程还有十几处。这是第一类粘滞线程。
第四个粘滞线程:
####<Mar 25, 2017 9:33:09 AM CST> <Error> <WebLogicServer> <PSFPWEB01> <psfp_in> <[ACTIVE] ExecuteThread: '94' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1490405589083> <BEA-000337> <[STUCK] ExecuteThread: '19' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "626" seconds working on the request "weblogic.servlet.internal.ServletRequestImpl@5c1546e8[
POST /psfp-issued/resource/com.defonds
.core.psfp.issued.biz.issued.resource.BatchIssuedResource/issued HTTP/1.1
Content-Type: application/json
User-Agent: Java1.6.0_33
Accept: text/html, image/gif, image/jpeg, */*; q=.2
Connection: Keep-Alive
Content-Length: 472401
]", which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace:
Thread-412 "[STUCK] ExecuteThread: '19' for queue: 'weblogic.kernel.Default (self-tuning)'" <alive, in native, suspended, priority=1, DAEMON> {
jrockit.net.SocketNativeIO.readBytesPinned(SocketNativeIO.java:???)
jrockit.net.SocketNativeIO.socketRead(SocketNativeIO.java:24)
java.net.SocketInputStream.socketRead0(SocketInputStream.java:???)
java.net.SocketInputStream.read(SocketInputStream.java:107)
oracle.net.ns.Packet.receive(Packet.java:247)
oracle.net.ns.DataPacket.receive(DataPacket.java:92)
oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:166)
oracle.net.ns.NetInputStream.read(NetInputStream.java:111)
oracle.net.ns.NetInputStream.read(NetInputStream.java:94)
oracle.net.ns.NetInputStream.read(NetInputStream.java:79)
oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:118)
oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:73)
oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1034)
oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:1016)
oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:560)
oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:67)
oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedStatement.java:949)
oracle.jdbc.driver.T4CPreparedStatement.executeMaybeDescribe(T4CPreparedStatement.java:832)
oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1149)
oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3367)
oracle.jdbc.driver.OraclePreparedStatement.execute(OraclePreparedStatement.java:3480)
^-- Holding lock: oracle.jdbc.driver.T4CConnection@224b743d[thin lock]
oracle.jdbc.driver.OraclePreparedStatementWrapper.execute(OraclePreparedStatementWrapper.java:1374)
weblogic.jdbc.wrapper.PreparedStatement.execute(PreparedStatement.java:87)
sun.reflect.GeneratedMethodAccessor232.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:575)
org.jdbcdslog.PreparedStatementLoggingHandler.invoke(PreparedStatementLoggingHandler.java:29)
$Proxy812.execute(Unknown Source)
org.apache.ibatis.executor.statement.PreparedStatementHandler.query(PreparedStatementHandler.java:55)
org.apache.ibatis.executor.statement.RoutingStatementHandler.query(RoutingStatementHandler.java:70)
sun.reflect.GeneratedMethodAccessor3797.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:575)
org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:55)
$Proxy1255.query(Unknown Source)
org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:52)
org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:265)
org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:129)
org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:85)
org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:79)
org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:100)
org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:95)
org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:59)
sun.reflect.GeneratedMethodAccessor3979.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:575)
org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:349)
$Proxy577.selectOne(Unknown Source)
org.mybatis.spring.SqlSessionTemplate.selectOne(SqlSessionTemplate.java:159)
org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:76)
org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:35)
$Proxy587.queryCurrentDayStatistics(Unknown Source)
com.defonds.core.psfp.issued.biz.issued.service.impl.ManageIssuedServiceImpl.validateCardLimit(ManageIssuedServiceImpl.java:935)
com.defonds.core.psfp.issued.biz.issued.service.impl.ManageIssuedServiceImpl.singleCardLimt(ManageIssuedServiceImpl.java:225)
com.defonds.core.psfp.issued.biz.issued.service.impl.ManageIssuedServiceImpl.txnhandleBatchIssued(ManageIssuedServiceImpl.java:108)
com.defonds.core.psfp.issued.biz.issued.service.impl.ManageIssuedServiceImpl.txnHandleBankBatchIssued(ManageIssuedServiceImpl.java:96)
com.defonds.core.psfp.issued.biz.issued.service.impl.ManageIssuedServiceImpl$$FastClassByCGLIB$$aa55b756.invoke(<generated>:???)
org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:202)
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:694)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:95)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:88)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:602)
com.defonds.core.psfp.issued.biz.issued.service.impl.ManageIssuedServiceImpl$$EnhancerByCGLIB$$d8012971.txnHandleBankBatchIssued(<generated>:???)
com.defonds.core.psfp.issued.biz.issued.service.impl.IssuedServiceImpl.txnHandleBatchIssued(IssuedServiceImpl.java:65)
com.defonds.core.psfp.issued.biz.issued.service.impl.IssuedServiceImpl$$FastClassByCGLIB$$a4bee6fb.invoke(<generated>:???)
org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:202)
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:694)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:95)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:88)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:602)
com.defonds.core.psfp.issued.biz.issued.service.impl.IssuedServiceImpl$$EnhancerByCGLIB$$24d19176.txnHandleBatchIssued(<generated>:???)
com.defonds.core.psfp.issued.biz.issued.resource.impl.DefaultBatchIssuedResource.issued(DefaultBatchIssuedResource.java:81)
sun.reflect.GeneratedMethodAccessor4455.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:575)
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:183)
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:185)
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:93)
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:80)
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:93)
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:72)
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1495)
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1441)
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1386)
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1377)
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:378)
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:538)
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:637)
javax.servlet.http.HttpServlet.service(HttpServlet.java:815)
weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.java:224)
weblogic.servlet.internal.StubSecurityHelper.invokeServlet(StubSecurityHelper.java:108)
weblogic.servlet.internal.ServletStubImpl.execute(ServletStubImpl.java:206)
weblogic.servlet.internal.TailFilter.doFilter(TailFilter.java:26)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
com.defonds.core.psfp.common.web.filter.LoggingProcessFilter.doFilterInternal(LoggingProcessFilter.java:31)
org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:89)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:82)
org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:89)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.doIt(WebAppServletContext.java:3656)
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3650)
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:308)
weblogic.security.service.SecurityManager.runAs(SecurityManager.java:117)
weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2204)
weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2126)
weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1412)
weblogic.work.ExecuteThread.execute(ExecuteThread.java:198)
weblogic.work.ExecuteThread.run(ExecuteThread.java:165)
}
这个 Thread-412 是第四个粘滞线程,相关业务代码如下:
com.defonds.core.psfp.issued.biz.issued.service.impl.ManageIssuedServiceImpl.validateCardLimit(ManageIssuedServiceImpl.java:935)
第 4、7、8、9、10 个粘滞线程都是这种。后续带有同样表述的粘滞线程还有二十几处。这是第二种类型的粘滞线程。
前 40 个发生粘滞的线程都是上述两种类型的粘滞线程。
3.2 问题的受害者
第 41 个粘滞线程 Thread-445:
####<Mar 25, 2017 10:14:09 AM CST> <Error> <WebLogicServer> <PSFPWEB01> <psfp_in> <[STANDBY] ExecuteThread: '214' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1490408049233> <BEA-000337> <[STUCK] ExecuteThread: '40' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "631" seconds working on the request "weblogic.servlet.internal.ServletRequestImpl@39e1c866[
POST /psfp-custom/resource/com.defonds
.core.psfp.custom.biz.cert.resource.FeeResource/callFeeBatch HTTP/1.1
Content-Type: application/json
Cache-Control: no-cache
Pragma: no-cache
User-Agent: Java/1.6.0_45
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Content-Length: 1948
]", which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace:
Thread-445 "[STUCK] ExecuteThread: '40' for queue: 'weblogic.kernel.Default (self-tuning)'" <alive, in native, suspended, priority=1, DAEMON> {
jrockit.net.SocketNativeIO.readBytesPinned(SocketNativeIO.java:???)
jrockit.net.SocketNativeIO.socketRead(SocketNativeIO.java:24)
java.net.SocketInputStream.socketRead0(SocketInputStream.java:???)
java.net.SocketInputStream.read(SocketInputStream.java:107)
java.io.BufferedInputStream.fill(BufferedInputStream.java:189)
java.io.BufferedInputStream.read(BufferedInputStream.java:236)
^-- Holding lock: java.io.BufferedInputStream@3b0703c8[thin lock]
weblogic.net.http.MessageHeader.isHTTP(MessageHeader.java:214)
weblogic.net.http.MessageHeader.parseHeader(MessageHeader.java:141)
weblogic.net.http.HttpClient.parseHTTP(HttpClient.java:460)
weblogic.net.http.HttpURLConnection.getInputStream(HttpURLConnection.java:328)
weblogic.net.http.SOAPHttpURLConnection.getInputStream(SOAPHttpURLConnection.java:37)
^-- Holding lock: weblogic.net.http.SOAPHttpURLConnection@3de1e472[thin lock]
weblogic.net.http.HttpURLConnection.getResponseCode(HttpURLConnection.java:939)
com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:106)
com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:268)
com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:121)
$Proxy293.queryBasicPrdCfg(Unknown Source)
sun.reflect.GeneratedMethodAccessor1674.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:575)
org.springframework.remoting.caucho.HessianClientInterceptor.invoke(HessianClientInterceptor.java:215)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:152)
$Proxy294.queryBasicPrdCfg(Unknown Source)
com.defonds.core.psfp.custom.biz.cert.service.impl.FeeServiceImpl.queryPrdCfg(FeeServiceImpl.java:500)
com.defonds.core.psfp.custom.biz.cert.service.impl.FeeServiceImpl.singleFeeCal(FeeServiceImpl.java:69)
com.defonds.core.psfp.custom.biz.cert.service.impl.FeeServiceImpl$$FastClassByCGLIB$$49a7eab1.invoke(<generated>:???)
org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:202)
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:694)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:95)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:602)
com.defonds.core.psfp.custom.biz.cert.service.impl.FeeServiceImpl$$EnhancerByCGLIB$$54a28944.singleFeeCal(<generated>:???)
com.defonds.core.psfp.custom.biz.cert.resource.impl.FeeResourceImpl.callFeeBatch(FeeResourceImpl.java:75)
sun.reflect.GeneratedMethodAccessor4112.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:575)
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:183)
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:185)
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:93)
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:80)
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:93)
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:72)
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1495)
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1441)
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1386)
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1377)
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:378)
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:538)
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:637)
javax.servlet.http.HttpServlet.service(HttpServlet.java:815)
weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.java:224)
weblogic.servlet.internal.StubSecurityHelper.invokeServlet(StubSecurityHelper.java:108)
weblogic.servlet.internal.ServletStubImpl.execute(ServletStubImpl.java:206)
weblogic.servlet.internal.TailFilter.doFilter(TailFilter.java:26)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
com.defonds.core.psfp.common.web.filter.LoggingProcessFilter.doFilterInternal(LoggingProcessFilter.java:31)
org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:89)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:82)
org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:89)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:55)
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.doIt(WebAppServletContext.java:3656)
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3650)
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:308)
weblogic.security.service.SecurityManager.runAs(SecurityManager.java:117)
weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2204)
weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2126)
weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1412)
weblogic.work.ExecuteThread.execute(ExecuteThread.java:198)
weblogic.work.ExecuteThread.run(ExecuteThread.java:165)
}
这是第三类粘滞线程,相关业务代码如下:
com.defonds.core.psfp.custom.biz.cert.service.impl.FeeServiceImpl.queryPrdCfg(FeeServiceImpl.java:500)
后续粘滞线程就比较五花八门了,不再赘述,相关业务代码有以下几种:
• com.defonds.core.psfp.issued.biz.issued.remote.bacs.CaculateBatchFeeImpl.calcFeeByPsfp(CaculateBatchFeeImpl.java:259)
• com.defonds.core.psfp.inagency.biz.fee.resource.impl.FeeTradeEntryResourceImpl.feeCharge(FeeTradeEntryResourceImpl.java:45)
• com.defonds.core.psfp.inagency.biz.query.resource.impl.BankQueryEntryResourceImpl.getBankList(BankQueryEntryResourceImpl.java:45)
• com.defonds.core.psfp.query.biz.query.service.impl.BankDirectQueryServiceImpl.queryBankList(BankDirectQueryServiceImpl.java:72)
• com.defonds.core.psfp.incomes.biz.mpay.service.impl.MpayServiceImpl.validatBankCard(MpayServiceImpl.java:869)
• com.defonds.core.psfp.manage.biz.fee.service.impl.FeeChargeServiceImpl.calcFeeAmt(FeeChargeServiceImpl.java:70)
• com.defonds.core.psfp.incomes.biz.sinopay.remote.bacs.impl.MerchantChargeServiceImpl.txnCalcTradeFeeDetail(MerchantChargeServiceImpl.java:50)
这些类型的粘滞线程都是在第一类发生后的 45 分钟之后才开始陆续出现。
3.3 分析
到此几乎可以断定第三类及其之后的粘滞线程都是问题的受害者。
第一类基本可以确定为问题的制造者,第二类 (最早出现于 9:33:09) 在第一类 (最早出现于 9:29:09) 出现之后的 4 分钟后开始大量出现,可能是问题的制造者,需要更多其他证据来断言。
4. 线程栈快照
由于监控机制的不够完善,从生产问题出现到运维人员响应,中间间隔了一个多小时。
本次事故的线程栈快照采集于 Sat Mar 25 10:45:57 2017,也就是第一个粘滞线程出现后的 77 分钟后。
尽管如此,也有一定的参考性。线程栈快照统计结果如下:
- 该进程启动线程数:724 个。
- 快照时间点处于粘滞状态 (STUCK) 的线程数:276 个。
根据第 3 步得出的结论,我们只关心两种问题制造者的业务类型:
com/defonds/core/psfp/inagency/biz/issued/resource/impl/BatchIssuedResourceImpl.issued
粘滞线程中这个有两处;而第二类业务类型的没有。
因为不具备时效性 (粘滞线程开始大批量出现后的 77 分钟后才拿的这个快照),强大的线程栈快照在这里几乎没有起到任何作用。
5. AWR 报告
向 DBA 要了一下那个时间段的 AWR 报告。
5.1. CPU 利用率一般
如上图所示,CPU利用率:267.32分钟DB时间/(16核心*60.24分钟采样时间段时间) = 27.73%。
当然这个数值不代表峰值利用率。
5.2. 存在行锁
如上图"Top 5 Timed Events"所示,行锁等待占用了 24% 的 CPU 时间。有趣的是 db file scattered read 占据了 15% 的 CPU 时间。
Streams miscellaneous event 和 CPU time 占用了 70% 的 CPU 时间。
5.3. 慢查询
慢查询 top5:
第一个 id 为 66k02hafsw6pd,sql 为:
SELECT TRD_ID, TRD_DATE, TRD_MERCODEFROM, TRD_ACCCODEFROM, TRD_COUNT, TRD_TOTALAMT, TRD_TYPE FROM T_PSFP_ISSUED_STATDAY WHERE TO_CHAR(TRD_DATE, 'YYYY-MM-DD') = TO_CHAR(SYSDATE, 'YYYY-MM-DD') AND TRD_MERCODEFROM = :1 AND TRD_ACCCODEFROM = :2 FOR UPDATE
T_PSFP_ISSUED_STATDAY 表的行锁,这个和上文统计的第一类粘滞线程遥相呼应,是一个典型的悲观锁,发生在批量下发业务逻辑中。
第二个慢查询也是一个悲观锁:
SELECT TRD_ID, TRD_DATE, TRD_BANKCARD, TRD_COUNT, TRD_TOTALAMT, TRD_TYPE FROM T_PSFP_ISSUED_STATCARDDAY WHERE TO_CHAR(TRD_DATE, 'YYYY-MM-DD') = TO_CHAR(SYSDATE, 'YYYY-MM-DD') AND TRD_BANKCARD = :1 FOR UPDATE
也是一个悲观锁造成的行锁,是 T_PSFP_ISSUED_STATCARDDAY 表的。和第一个悲观锁一样也是批量下发业务相关。
第 3 - 5 个都是其它一些表的慢查询。
6. 代码分析
综上,基本可以锁定本次故障原因为第一类粘滞线程的相关业务代码所造成:
com.defonds.core.psfp.inagency.biz.issued.resource.impl.BatchIssuedResourceImpl.issued(BatchIssuedResourceImpl.java:63)
需要对此方法进行进一步分析,这个方法的某些操作可能会和悲观锁互相拖慢,可能的操作有:
- 远程调用 I/O 等待。
- 这个方法里的其它慢查询。
笔者 20170331 日加注:经过对上述方法的 code review,悲观锁的同一个事务里果然还有很多读写 db 的操作,而且竟然还有一个同步通知商户的远程操作 - 就在同一个事务内!与之前分析的结果一致。
7. 优化方向
- 硬件负载失衡的原因调查。
- 批量下发悲观锁取消 (改为其它解决方案)。
- BatchIssuedResourceImpl.issued 方法优化 (悲观锁事务未提交前的部分)。