背景:用户业务虚拟机遇到不明原因导致操作系统重启,引起业务中断
需求:要求排查具体原因,定位问题根源
先来查看虚拟机的事件,事件发生时间: 13:37:21
再到虚拟机对应的宿主机查看相关日志,宿主机日志看到的时间需+8才能与VCenter上的事件时间对应得上,因此我们过滤05:37分左右的日志
VMKernel日志:
hostd日志:
结合vmkernel和hostd的日志基本只能看到操作系统被重启了,并没有看到直接导致虚拟机重启的原因,因此我们将注意力放在虚拟机操作系统内进行排错
在/var/crash目录下,能看到自动生成了vmcore日志文件,由于这是存在于生产业务机器上,因此我们将日志拷贝到测试环境
在这里我找了一台跟生产业务机一致内核版本的机器用于分析
[root@demo ~]# cd /opt/ # 安装与生产业务机一致内核版本的kernel-debug工具 [root@demo opt]# wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-3.10.0-957.el7.x86_64.rpm [root@demo opt]# mkdir test [root@demo opt]# mv kernel-debuginfo-3.10.0-957.el7.x86_64.rpm test/ [root@demo opt]# cd test # rpm2cpio将rpm包转换为cpio,再由cpio提取其内容保存在当前目录下 [root@demo test]# rpm2cpio kernel-debuginfo-3.10.0-957.el7.x86_64.rpm | cpio -idmv # 查看当前目录树结构,验证rpm已被提取保存在当前目录下 [root@demo test]# tree .|head . ├── kernel-debuginfo-3.10.0-957.el7.x86_64.rpm └── usr └── lib └── debug ├── lib │ └── modules │ └── 3.10.0-957.el7.x86_64 │ ├── kernel │ │ ├── arch # 查找rpm内的vmlinux文件路径,将其复制到/opt/下,方便接下来的操作 [root@demo test]# find ./ -name "vmlinux" [root@demo test]# cp ./usr/lib/debug/lib/modules/3.10.0-957.el7.x86_64/vmlinux /opt/
前期工作准备完成,接下来开始使用crash工具分析core日志
[root@demo test]# crash /opt/vmlinux /root/vmcore crash 7.2.3-8.el7 Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel relocated [710MB]: patching 85605 gdb minimal_symbol values KERNEL: /opt/vmlinux DUMPFILE: /root/vmcore [PARTIAL DUMP] CPUS: 8 DATE: Thu Jul 18 13:37:21 2024 UPTIME: 786 days, 20:01:15 LOAD AVERAGE: 1.36, 1.38, 1.45 TASKS: 720 NODENAME: 10-41-16-16 RELEASE: 3.10.0-957.el7.x86_64 VERSION: #1 SMP Thu Nov 8 23:39:32 UTC 2018 MACHINE: x86_64 (2095 Mhz) MEMORY: 8 GB PANIC: "general protection fault: 0000 [#1] SMP " PID: 56383 COMMAND: "XXXX" TASK: ffff98c1fcfbc100 [THREAD_INFO: ffff98c1d3bb8000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> bt PID: 56383 TASK: ffff98c1fcfbc100 CPU: 1 COMMAND: "XXXX" #0 [ffff98c3f5643aa8] machine_kexec at ffffffffad663674 #1 [ffff98c3f5643b08] __crash_kexec at ffffffffad71ce12 #2 [ffff98c3f5643bd8] crash_kexec at ffffffffad71cf00 #3 [ffff98c3f5643bf0] oops_end at ffffffffadd6c758 #4 [ffff98c3f5643c18] die at ffffffffad62f95b #5 [ffff98c3f5643c48] do_general_protection at ffffffffadd6c052 #6 [ffff98c3f5643c80] general_protection at ffffffffadd6b6f8 [exception RIP: kmem_cache_alloc+116] RIP: ffffffffad81bcf4 RSP: ffff98c3f5643d30 RFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000780 RCX: 0000002447835d81 RDX: 0000002447835d80 RSI: 0000000000000020 RDI: ffff98c33fc07700 RBP: ffff98c3f5643d60 R8: 000000000001f120 R9: ffffffffadc279da R10: ffff98c3e6e60f78 R11: ffff98c1f534c8c0 R12: 746e657645534620 R13: 0000000000000020 R14: ffff98c33fc07700 R15: ffff98c33fc07700 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff98c3f5643d68] __build_skb at ffffffffadc279da #8 [ffff98c3f5643d88] __netdev_alloc_skb at ffffffffadc27cc9 #9 [ffff98c3f5643dc0] vmxnet3_rq_rx_complete at ffffffffc036b5f6 [vmxnet3] #10 [ffff98c3f5643e48] vmxnet3_poll_rx_only at ffffffffc036c0f6 [vmxnet3] #11 [ffff98c3f5643e78] net_rx_action at ffffffffadc39e9f #12 [ffff98c3f5643ef8] __do_softirq at ffffffffad6a0f05 #13 [ffff98c3f5643f68] call_softirq at ffffffffadd7832c #14 [ffff98c3f5643f80] do_softirq at ffffffffad62e675 #15 [ffff98c3f5643fa0] irq_exit at ffffffffad6a1285 #16 [ffff98c3f5643fb8] do_IRQ at ffffffffadd795e6 --- <IRQ stack> --- #17 [ffff98c1d3bbb208] ret_from_intr at ffffffffadd6b362 [exception RIP: extract_dns_request+56] RIP: ffffffffc15ce188 RSP: ffff98c1d3bbb2b8 RFLAGS: 00000217 RAX: 00000000003e3016 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 00000000003e3015 RSI: ffff98c1d3bbb330 RDI: 00000000ffffff8f RBP: ffff98c1d3bbb2b8 R8: 0000000000000000 R9: ffff98c3d4efde38 R10: 00000000ffffff8f R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffcc CS: 0010 SS: 0018 #18 [ffff98c1d3bbb2c0] handle_dns_query at ffffffffc15cebea [uniedr_edr] RIP: 10ffff98c3f75298 RSP: 0000000000000000 RFLAGS: 01000009 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff00000000000000 RDX: 000000ffff0000ff RSI: 0000000000000000 RDI: 4500000000000000 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 000000081e000000 R13: 0000000000000000 R14: 0000000000000000 R15: 3500000000000000 ORIG_RAX: 00000006c0000001 CS: ffff98c3f75299 SS: 4000000000000000 bt: WARNING: possibly bogus exception frame # 以上信息可以看到 操作系统奔溃类型为 PANIC: "general protection fault: 0000 [#1] SMP " # 奔溃时运行的任务ID为:ffff98c1fcfbc100
接着再根据任务ID查看更详细的信息
crash> bt -t ffff98c1fcfbc100 PID: 56383 TASK: ffff98c1fcfbc100 CPU: 1 COMMAND: "XXXXX" START: machine_kexec at ffffffffad663674 [ffff98c3f5643aa8] machine_kexec at ffffffffad663674 [ffff98c3f5643b08] __crash_kexec at ffffffffad71ce12 [ffff98c3f5643b50] __build_skb at ffffffffadc279da [ffff98c3f5643b90] kmem_cache_alloc at ffffffffad81bcf4 [ffff98c3f5643bd8] crash_kexec at ffffffffad71cf00 [ffff98c3f5643bf0] oops_end at ffffffffadd6c758 [ffff98c3f5643c18] die at ffffffffad62f95b [ffff98c3f5643c48] do_general_protection at ffffffffadd6c052 [ffff98c3f5643c80] general_protection at ffffffffadd6b6f8 [ffff98c3f5643cc8] __build_skb at ffffffffadc279da [ffff98c3f5643d08] kmem_cache_alloc at ffffffffad81bcf4 [ffff98c3f5643d68] __build_skb at ffffffffadc279da [ffff98c3f5643d88] __netdev_alloc_skb at ffffffffadc27cc9 [ffff98c3f5643dc0] vmxnet3_rq_rx_complete at ffffffffc036b5f6 [vmxnet3] [ffff98c3f5643e48] vmxnet3_poll_rx_only at ffffffffc036c0f6 [vmxnet3] [ffff98c3f5643e78] net_rx_action at ffffffffadc39e9f [ffff98c3f5643ef8] __do_softirq at ffffffffad6a0f05 [ffff98c3f5643f68] call_softirq at ffffffffadd7832c [ffff98c3f5643f80] do_softirq at ffffffffad62e675 [ffff98c3f5643fa0] irq_exit at ffffffffad6a1285 [ffff98c3f5643fb8] do_IRQ at ffffffffadd795e6 [ffff98c3f5643ff0] ret_from_intr at ffffffffadd6b362 --- <IRQ stack> --- [ffff98c1d3bbb208] ret_from_intr at ffffffffadd6b362 [exception RIP: extract_dns_request+56] RIP: ffffffffc15ce188 RSP: ffff98c1d3bbb2b8 RFLAGS: 00000217 RAX: 00000000003e3016 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 00000000003e3015 RSI: ffff98c1d3bbb330 RDI: 00000000ffffff8f RBP: ffff98c1d3bbb2b8 R8: 0000000000000000 R9: ffff98c3d4efde38 R10: 00000000ffffff8f R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffcc CS: 0010 SS: 0018 [ffff98c1d3bbb2c0] handle_dns_query at ffffffffc15cebea [uniedr_edr] RIP: 10ffff98c3f75298 RSP: 0000000000000000 RFLAGS: 01000009 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff00000000000000 RDX: 000000ffff0000ff RSI: 0000000000000000 RDI: 4500000000000000 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 000000081e000000 R13: 0000000000000000 R14: 0000000000000000 R15: 3500000000000000 ORIG_RAX: 00000006c0000001 CS: ffff98c3f75299 SS: 4000000000000000 bt: WARNING: possibly bogus exception frame
总结:
从 START项往下查看,可以看到由machine_kexec调用ffffffffad663674地址,接着发生一系列调用操作,直到触发crash_kexec(系统奔溃函数)后出现保护操作general_protection(与前面看到的奔溃类型一致 ---> PANIC: "general protection fault: 0000 [#1] SMP ")即可确认该日志报错信息与实际奔溃类型一致,而引起此次操作系统重启的原因是:
COMMAND: "XXXX" 命令导致的(这里的XXXX是笔者人为替换的),而该命令正是操作系统内的应用引起的,因此可以将注意力集中在应用本身进行排查
标签:__,crash,虚拟机,0000000000000000,64,内核,RAX,root,VMware From: https://www.cnblogs.com/Ky150/p/18310569