首页 > 其他分享 >记录实验室深度学习服务器显卡硬件故障的排查——

记录实验室深度学习服务器显卡硬件故障的排查——

时间:2022-11-11 13:46:12浏览次数:62  
标签:kernel 00 Xid dell 排查 PCI 服务器 显卡 NVRM

实验室突然通知我说是深度学习的服务器无法查看GPU,并且在GPU上运行的程序也halt on,需要解决。于是查询服务器的运行日志得到下面的信息:

 

Nov 10 01:33:23 dell kernel: [3238114.018736] NVRM: Xid (PCI:0000:b1:00): 43, pid=45948, Ch 00000008
Nov 10 01:38:12 dell kernel: [3238403.448442] NVRM: Xid (PCI:0000:b1:00): 43, pid=51064, Ch 00000008
Nov 10 01:39:11 dell kernel: [3238462.127610] NVRM: Xid (PCI:0000:b1:00): 62, pid=51064, 21b3(31c4) 00000000 00000000
Nov 10 01:43:32 dell kernel: [3238722.985986] NVRM: Xid (PCI:0000:b1:00): 45, pid=3300, Ch 00000000
Nov 10 01:43:32 dell kernel: [3238722.988964] NVRM: Xid (PCI:0000:b1:00): 45, pid=3300, Ch 00000001
Nov 10 01:43:32 dell kernel: [3238722.991786] NVRM: Xid (PCI:0000:b1:00): 45, pid=1544, Ch 00000002
Nov 10 01:43:32 dell kernel: [3238722.993928] NVRM: Xid (PCI:0000:b1:00): 45, pid=1544, Ch 00000003
Nov 10 01:43:32 dell kernel: [3238722.995701] NVRM: Xid (PCI:0000:b1:00): 45, pid=1544, Ch 00000004
Nov 10 01:43:32 dell kernel: [3238722.997629] NVRM: Xid (PCI:0000:b1:00): 45, pid=1544, Ch 00000005
Nov 10 01:43:32 dell kernel: [3238722.999373] NVRM: Xid (PCI:0000:b1:00): 45, pid=1544, Ch 00000006
Nov 10 01:43:32 dell kernel: [3238723.001108] NVRM: Xid (PCI:0000:b1:00): 45, pid=1544, Ch 00000007
Nov 10 01:43:32 dell kernel: [3238723.002705] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 00000008
Nov 10 01:43:32 dell kernel: [3238723.504007] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 00000009
Nov 10 01:43:32 dell kernel: [3238723.505675] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 0000000a
Nov 10 01:43:32 dell kernel: [3238723.507158] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 0000000b
Nov 10 01:43:32 dell kernel: [3238723.508527] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 0000000c
Nov 10 01:43:32 dell kernel: [3238723.509823] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 0000000d
Nov 10 01:43:32 dell kernel: [3238723.511155] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 0000000e
Nov 10 01:43:32 dell kernel: [3238723.512501] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 0000000f
Nov 10 01:43:32 dell kernel: [3238723.513788] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 00000010
Nov 10 01:43:32 dell kernel: [3238723.515211] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 00000011
Nov 10 01:43:32 dell kernel: [3238723.516537] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 00000012
Nov 10 01:43:32 dell kernel: [3238723.517836] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 00000013
Nov 10 01:43:32 dell kernel: [3238723.519163] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 00000014
Nov 10 01:43:32 dell kernel: [3238723.520567] NVRM: Xid (PCI:0000:b1:00): 45, pid=55094, Ch 00000015

 

查看nvidia官方的文档:

https://docs.nvidia.com/deploy/xid-errors/index.html

 

 可以看到这个错误大概率是应用程序的问题。

 

顺着日志往前查看,发现了相似的日志信息:

Oct 25 11:46:44 dell kernel: [1892628.496902] NVRM: Xid (PCI:0000:d9:00): 43, pid=34973, Ch 00000008
Oct 28 08:02:50 dell kernel: [2138374.168198] NVRM: Xid (PCI:0000:d9:00): 43, pid=79247, Ch 00000008

很明显相似的报错信息以前也都出现过,此时的判断依然是应用程序造成的错误。

 

标签:kernel,00,Xid,dell,排查,PCI,服务器,显卡,NVRM
From: https://www.cnblogs.com/devilmaycry812839668/p/16880232.html

相关文章