背景及情况:
周一早上服务巡检发现11点线上有一个中台服务全部宕机下线了
查看Tara服务监控,发现服务CPU、内存、连接池等监控都不高。第一波11点40多这一次宕机运维没有抓到dump信息,只有如下图的一个Windows系统错误日志
第二波12点02分左右又一次宕机,这次运维抓到了dump。于是把dump文件拿过来开撸:
dump分析
- 分析工具 windbg Preview
- 使用.loadby sos clr 指令用于加载模块
- !threadpool 查看dump文件时cpu的使用情况
||0:0:000> !threadpool
CPU utilization: 26%
Worker Thread: Total: 84 Running: 0 Idle: 84 MaxLimit: 32767 MinLimit: 500
Work Request in Queue: 0
--------------------------------------
Number of Timers: 2
--------------------------------------
Completion Port Thread:Total: 26 Free: 17 MaxFree: 16 CurrentLimit: 20 MaxLimit: 1000 MinLimit: 1000
- !threads 列表显示出应用程序所有正在运行的线程,当前应用程序域中后台正在运行的程序等线程相关内容
0:000> !threads
ThreadCount: 190
UnstartedThread: 0
BackgroundThread: 172
PendingThread: 0
DeadThread: 8
Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 10d4 000001d6674ca230 2a020 Preemptive 0000000000000000:0000000000000000 000001d6674bd150 0 MTA
10 2 51c 000001daf364eff0 2b220 Preemptive 0000000000000000:0000000000000000 000001d6674bd150 0 MTA (Finalizer)
12 4 2980 000001daf39c5080 1029220 Preemptive 000001D6EF6D2DA0:000001D6EF6D4D38 000001d6674bd150 0 MTA (Threadpool Worker)
13 5 ac8 000001daf39c6170 1029220 Preemptive 000001D96EEC5C20:000001D96EEC6178 000001d6674bd150 0 MTA (Threadpool Worker)
14 6 1e14 000001daf3a19590 1029220 Preemptive 000001D77372A640:000001D77372B010 000001d6674bd150 0 MTA (Threadpool Worker)
15 7 2278 000001daf3a83d20 202b220 Preemptive 0000000000000000:0000000000000000 000001d6674bd150 0 MTA
16 8 2b80 000001daf3a7c540 202b220 Preemptive 0000000000000000:0000000000000000 000001d6674bd150 0 MTA
17 9 2a3c 000001daf3a7cd10 102a220 Preemptive 0000000000000000:0000000000000000 000001d6674bd150 0 MTA (Threadpool Worker)
185 181 27a4 000001dafd751a50 8029220 Preemptive 0000000000000000:0000000000000000 000001daf3a66e90 0 MTA (Threadpool Completion Port) System.StackOverflowException 000001d8e7c11158
可以看到185号线程发生了栈溢出
!analyze -v 命令分析当前最近的异常事件, -v 显示异常的详细信息
0:000> !analyze -v
DBGHELP: D:\Dump\symbols\SOS_AMD64_AMD64_4.7.3701.00.dll\5F4FF3579ec000\SOS_AMD64_AMD64_4.7.3701.00.dll - OK
DBGHELP: D:\Dump\symbols\clr.dll\5F4FF3579ec000\clr.dll - OK
MethodDesc: 00007ffd3cfec9d8
Method Name: HtmlAgilityPack.HtmlNodeCollection.System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>.GetEnumerator()
Class: 00007ffd3cff3d68
MethodTable: 00007ffd3cfecb00
mdToken: 00000000060000eb
Module: 00007ffd386f2d30
IsJitted: yes
CodeAddr: 00007ffd3cd288e0
Transparency: Transparent
MethodDesc: 00007ffd3cfe65d8
Method Name: HtmlAgilityPack.HtmlNode.CloseNode(HtmlAgilityPack.HtmlNode)
Class: 00007ffd3cff0f58
MethodTable: 00007ffd3cfe67a0
MethodDesc: 00007ffd3cfe65d8
Method Name: HtmlAgilityPack.HtmlNode.CloseNode(HtmlAgilityPack.HtmlNode)
Class: 00007ffd3cff0f58
MethodTable: 00007ffd3cfe67a0
mdToken: 000000000600007d
Module: 00007ffd386f2d30
IsJitted: yes
CodeAddr: 00007ffd3cd28580
Transparency: Transparent
MethodDesc: 00007ffd3cfe5b00
Method Name: HtmlAgilityPack.HtmlDocument.CloseCurrentNode()
Class: 00007ffd3cff0e40
MethodTable: 00007ffd3cfe5dc0
mdToken: 00000000060000c4
Module: 00007ffd386f2d30
IsJitted: yes
CodeAddr: 00007ffd3cd265f0
Transparency: Transparent
MethodDesc: 00007ffd3cfe5c00
Method Name: HtmlAgilityPack.HtmlDocument.PushNodeEnd(Int32, Boolean)
Class: 00007ffd3cff0e40
MethodTable: 00007ffd3cfe5dc0
mdToken: 00000000060000d4
Module: 00007ffd386f2d30
IsJitted: yes
CodeAddr: 00007ffd3cd262e0
Transparency: Transparent
MethodDesc: 00007ffd3cfe5ba0
Method Name: HtmlAgilityPack.HtmlDocument.Parse()
Class: 00007ffd3cff0e40
MethodTable: 00007ffd3cfe5dc0
mdToken: 00000000060000ce
Module: 00007ffd386f2d30
IsJitted: yes
CodeAddr: 00007ffd3cd25370
Transparency: Transparent
MethodDesc: 00007ffd3cfe5a10
Method Name: HtmlAgilityPack.HtmlDocument.Load(System.IO.TextReader)
Class: 00007ffd3cff0e40
MethodTable: 00007ffd3cfe5dc0
mdToken: 00000000060000b5
Module: 00007ffd386f2d30
IsJitted: yes
CodeAddr: 00007ffd3cd24a30
Transparency: Transparent
MethodDesc: 00007ffd3cfe5a20
Method Name: HtmlAgilityPack.HtmlDocument.LoadHtml(System.String)
Class: 00007ffd3cff0e40
MethodTable: 00007ffd3cfe5dc0
mdToken: 00000000060000b6
Module: 00007ffd386f2d30
IsJitted: yes
CodeAddr: 00007ffd3cd24910
Transparency: Transparent
MethodDesc: 00007ffd3a5a17b8
Method Name: xxx.ServiceImp.ButtonListTriggerProvider.IsDocument(Int32, System.String, Int32 ByRef)
Class: 00007ffd3a595498
MethodTable: 00007ffd3a5a1988
mdToken: 00000000060008af
Module: 00007ffd37fa9760
IsJitted: yes
CodeAddr: 00007ffd3cd23340
Transparency: Critical
MethodDesc: 00007ffd3a5a17a0
Method Name: xxx.ButtonListTriggerProvider.IsShowButtons(Int32, Int32, System.Guid, System.String)
Class: 00007ffd3a595498
MethodTable: 00007ffd3a5a1988
mdToken: 00000000060008ad
Module: 00007ffd37fa9760
IsJitted: yes
CodeAddr: 00007ffd3c644510
Transparency: Critical
MethodDesc: 00007ffd38f54648
可以看到一个第三方程序包
HtmlAgilityPack.HtmlNode.CloseNode方法导致栈溢出,入口点在IsShowButtons方法
- ~185s 切换到185号线程
- !dso / !dumpstackobjects 查看当前线程的堆栈中所有托管对象
000000CA2CA7D190 000001d6e9b3a5d8 System.String IsShowButtons:校验权限通过
000000CA2CA7D198 000001d867c96b50 System.String
000000CA2CA7D1A0 000001d76a4c12e0 System.String Show_Button_By_ChannelName
000000CA2CA7D1B0 000001d6680ef500 xxx.ServiceImp.ButtonListTriggerProvider
000000CA2CA7D1C8 000001d76a4c12a0 System.Func`xxx.ServiceInterface.DTO.ApplicantIntegration.CheckBlackListResultDTO, xxx.ServiceInterface],[System.Boolean, mscorlib
000000CA2CA7D210 000001d6e9b3a540 System.String IsShowButtons
000000CA2CA7D218 000001d6eb1888a0 <>f__AnonymousType14`4[[System.Int32, mscorlib],[System.Int32, mscorlib],[System.Guid, mscorlib],[System.String, mscorlib]] 000000CA2CA7D228 000001d767cf80a8 System.Object[] (System.Object[]) 000000CA2CA7D240 000001d6e9b3a540 System.String IsShowTeButtons 000000CA2CA7D248 000001d6eb1888a0 <>f__AnonymousType14`4[[System.Int32, mscorlib],[System.Int32, mscorlib],[System.Guid, mscorlib],[System.String, mscorlib]] 000000CA2CA7D258 000001d767cf80a8 System.Object[] (System.Object[])
发现000001d6eb1888a0 和源码中的方法一样,查看这个对象的参数:
0:185> !do /d 000001d6eb1888a0
Name: <>f__AnonymousType14`4System.Int32, mscorlib],[System.Int32, mscorlib],[System.Guid, mscorlib],[System.String, mscorlib
MethodTable: 00007ffd3cb557c0
EEClass: 00007ffd3cb05ca0
Size: 48(0x30) bytes
File: C:\Windows\system32\config\systemprofile\AppData\Local\assembly\dl3\A4W1852V.V6G\94HV9VM9.8EZ\6e6327dc\0063dc29_97fcd801\xxx.ServiceImp.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ffd9330c148 4000029 10 System.Int32 1 instance 606939 i__Field
00007ffd9330c148 400002a 14 System.Int32 1 instance 606418892 i__Field
00007ffd932f4840 400002b 18 System.Guid 1 instance 000001d6eb1888b8 i__Field
00007ffd9330e2b8 400002c 8 System.__Canon 0 instance 000001d6eb187908 i__Field
- 参数信息 !do /d 000001d6eb187908
0:185> !do /d 000001d6eb187908Free ObjectSize: 12544(0x3100) bytes
看不到free object具体信息,于是查看内存信息:
- dc 000001d6eb187908
0:185> db /d 000001d6eb187908 L1000
Unknown option 'd'
000001d6`eb187908 10 a5 49 67 d6 01 00 00-e8 30 00 00 00 00 00 00 ..Ig.....0......
000001d6`eb187918 d0 48 1f eb d6 01 00 00-52 00 65 00 73 00 75 00 .H......R.e.s.u.
000001d6`eb187928 6d 00 65 00 2f 00 36 00-30 00 36 00 39 00 33 00 m.e./.6.0.6.9.3.
000001d6`eb187938 39 00 2f 00 31 00 36 00-36 00 34 00 31 00 38 00 9./.1.6.6.4.1.8.
000001d6`eb187948 32 00 34 00 37 00 34 00-2f 00 39 00 33 00 33 00 2.4.7.4./.9.3.3.
000001d6`eb187958 34 00 63 00 64 00 36 00-32 00 38 00 65 00 34 00 4.c.d.6.2.8.e.4.
000001d6`eb187968 30 00 34 00 36 00 32 00-31 00 39 00 65 00 62 00 0.4.6.2.1.9.e.b.
000001d6`eb187978 34 00 65 00 31 00 62 00-35 00 65 00 33 00 38 00 4.e.1.b.5.e.3.8.
000001d6`eb187988 65 00 39 00 34 00 35 00-32 00 2e 00 70 00 64 00 e.9.4.5.2...p.d.
000001d6`eb187998 66 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 f...............
000001d6`eb1879a8 10 c6 b9 38 fd 7f 00 00-00 00 00 00 00 00 00 00 ...8............
000001d6`eb1879b8 1b b3 73 64 1f 1b 00 00-01 00 00 00 00 00 00 00 ..sd............
000001d6`eb1879c8 00 00 00 00 00 00 00 00-e0 53 b5 3c fd 7f 00 00 .........S.<....
000001d6`eb1879d8 18 7a 18 eb d6 01 00 00-f8 d1 af ec d7 01 00 00 .z..............
000001d6`eb1879e8 78 dc af ec d7 01 00 00-db 42 09 00 cc 37 25 24 x........B...7%$
000001d6`eb1879f8 01 01 00 00 00 00 00 00-30 ac bb b2 96 3c 94 43 ........0....<.C
000001d6`eb187a08 a5 8a e1 ed 97 3d 98 0d-00 00 00 00 00 00 00 00 .....=..........
000001d6`eb187a18 f8 97 30 93 fd 7f 00 00-35 07 00 00 7b 00 22 00 ..0.....5...{.".
至此找到了引起栈溢出的源文件pdf路径:
606939/1664182474/9334cd628e4046219eb4e1b5e38e9452.pdf。进一步查看源文件发现改文件是一个40多M的pdf文件。
根本原因分析有2个:
- 该文件其实不是一个有效的html文件,代码没有判断直接把40M的pdf文件的内容读取出来当成html来处理;
- 引入第三方的nuget包HtmlAgilityPack,该包低版本有该异常,在高版本已解决。因此引入第三方的包必须谨慎,且应该经过公司架构评审。