首页 > 系统相关 >记一次VMware 虚拟机遇到意外重启的内核级排查操作

记一次VMware 虚拟机遇到意外重启的内核级排查操作

时间:2024-07-18 23:08:26浏览次数:16  
标签:__ crash 虚拟机 0000000000000000 64 内核 RAX root VMware

背景:用户业务虚拟机遇到不明原因导致操作系统重启,引起业务中断

需求:要求排查具体原因,定位问题根源

先来查看虚拟机的事件,事件发生时间: 13:37:21

 再到虚拟机对应的宿主机查看相关日志,宿主机日志看到的时间需+8才能与VCenter上的事件时间对应得上,因此我们过滤05:37分左右的日志

 VMKernel日志:

  hostd日志:

 结合vmkernel和hostd的日志基本只能看到操作系统被重启了,并没有看到直接导致虚拟机重启的原因,因此我们将注意力放在虚拟机操作系统内进行排错

 在/var/crash目录下,能看到自动生成了vmcore日志文件,由于这是存在于生产业务机器上,因此我们将日志拷贝到测试环境

 

在这里我找了一台跟生产业务机一致内核版本的机器用于分析

 

[root@demo ~]# cd /opt/

# 安装与生产业务机一致内核版本的kernel-debug工具
[root@demo opt]# wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-3.10.0-957.el7.x86_64.rpm

[root@demo opt]# mkdir test
[root@demo opt]# mv kernel-debuginfo-3.10.0-957.el7.x86_64.rpm  test/

[root@demo opt]# cd test

# rpm2cpio将rpm包转换为cpio,再由cpio提取其内容保存在当前目录下
[root@demo test]# rpm2cpio kernel-debuginfo-3.10.0-957.el7.x86_64.rpm | cpio -idmv

# 查看当前目录树结构,验证rpm已被提取保存在当前目录下
[root@demo test]# tree .|head 
.
├── kernel-debuginfo-3.10.0-957.el7.x86_64.rpm
└── usr
    └── lib
        └── debug
            ├── lib
            │   └── modules
            │       └── 3.10.0-957.el7.x86_64
            │           ├── kernel
            │           │   ├── arch

# 查找rpm内的vmlinux文件路径,将其复制到/opt/下,方便接下来的操作
[root@demo test]# find ./ -name "vmlinux"
[root@demo test]# cp ./usr/lib/debug/lib/modules/3.10.0-957.el7.x86_64/vmlinux /opt/

 前期工作准备完成,接下来开始使用crash工具分析core日志

[root@demo test]# crash /opt/vmlinux /root/vmcore

crash 7.2.3-8.el7
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [710MB]: patching 85605 gdb minimal_symbol values

      KERNEL: /opt/vmlinux                                             
    DUMPFILE: /root/vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Thu Jul 18 13:37:21 2024
      UPTIME: 786 days, 20:01:15
LOAD AVERAGE: 1.36, 1.38, 1.45
       TASKS: 720
    NODENAME: 10-41-16-16
     RELEASE: 3.10.0-957.el7.x86_64
     VERSION: #1 SMP Thu Nov 8 23:39:32 UTC 2018
     MACHINE: x86_64  (2095 Mhz)
      MEMORY: 8 GB
       PANIC: "general protection fault: 0000 [#1] SMP "
         PID: 56383
     COMMAND: "XXXX"
        TASK: ffff98c1fcfbc100  [THREAD_INFO: ffff98c1d3bb8000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt 
PID: 56383  TASK: ffff98c1fcfbc100  CPU: 1   COMMAND: "XXXX"
 #0 [ffff98c3f5643aa8] machine_kexec at ffffffffad663674
 #1 [ffff98c3f5643b08] __crash_kexec at ffffffffad71ce12
 #2 [ffff98c3f5643bd8] crash_kexec at ffffffffad71cf00
 #3 [ffff98c3f5643bf0] oops_end at ffffffffadd6c758
 #4 [ffff98c3f5643c18] die at ffffffffad62f95b
 #5 [ffff98c3f5643c48] do_general_protection at ffffffffadd6c052
 #6 [ffff98c3f5643c80] general_protection at ffffffffadd6b6f8
    [exception RIP: kmem_cache_alloc+116]
    RIP: ffffffffad81bcf4  RSP: ffff98c3f5643d30  RFLAGS: 00010286
    RAX: 0000000000000000  RBX: 0000000000000780  RCX: 0000002447835d81
    RDX: 0000002447835d80  RSI: 0000000000000020  RDI: ffff98c33fc07700
    RBP: ffff98c3f5643d60   R8: 000000000001f120   R9: ffffffffadc279da
    R10: ffff98c3e6e60f78  R11: ffff98c1f534c8c0  R12: 746e657645534620
    R13: 0000000000000020  R14: ffff98c33fc07700  R15: ffff98c33fc07700
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff98c3f5643d68] __build_skb at ffffffffadc279da
 #8 [ffff98c3f5643d88] __netdev_alloc_skb at ffffffffadc27cc9
 #9 [ffff98c3f5643dc0] vmxnet3_rq_rx_complete at ffffffffc036b5f6 [vmxnet3]
#10 [ffff98c3f5643e48] vmxnet3_poll_rx_only at ffffffffc036c0f6 [vmxnet3]
#11 [ffff98c3f5643e78] net_rx_action at ffffffffadc39e9f
#12 [ffff98c3f5643ef8] __do_softirq at ffffffffad6a0f05
#13 [ffff98c3f5643f68] call_softirq at ffffffffadd7832c
#14 [ffff98c3f5643f80] do_softirq at ffffffffad62e675
#15 [ffff98c3f5643fa0] irq_exit at ffffffffad6a1285
#16 [ffff98c3f5643fb8] do_IRQ at ffffffffadd795e6
--- <IRQ stack> ---
#17 [ffff98c1d3bbb208] ret_from_intr at ffffffffadd6b362
    [exception RIP: extract_dns_request+56]
    RIP: ffffffffc15ce188  RSP: ffff98c1d3bbb2b8  RFLAGS: 00000217
    RAX: 00000000003e3016  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 00000000003e3015  RSI: ffff98c1d3bbb330  RDI: 00000000ffffff8f
    RBP: ffff98c1d3bbb2b8   R8: 0000000000000000   R9: ffff98c3d4efde38
    R10: 00000000ffffff8f  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffcc  CS: 0010  SS: 0018
#18 [ffff98c1d3bbb2c0] handle_dns_query at ffffffffc15cebea [uniedr_edr]
    RIP: 10ffff98c3f75298  RSP: 0000000000000000  RFLAGS: 01000009
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: ff00000000000000
    RDX: 000000ffff0000ff  RSI: 0000000000000000  RDI: 4500000000000000
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 000000081e000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 3500000000000000
    ORIG_RAX: 00000006c0000001  CS: ffff98c3f75299  SS: 4000000000000000
bt: WARNING: possibly bogus exception frame

# 以上信息可以看到 操作系统奔溃类型为 PANIC: "general protection fault: 0000 [#1] SMP "
# 奔溃时运行的任务ID为:ffff98c1fcfbc100

 接着再根据任务ID查看更详细的信息

crash> bt -t ffff98c1fcfbc100
PID: 56383  TASK: ffff98c1fcfbc100  CPU: 1   COMMAND: "XXXXX"
              START: machine_kexec at ffffffffad663674
  [ffff98c3f5643aa8] machine_kexec at ffffffffad663674
  [ffff98c3f5643b08] __crash_kexec at ffffffffad71ce12
  [ffff98c3f5643b50] __build_skb at ffffffffadc279da
  [ffff98c3f5643b90] kmem_cache_alloc at ffffffffad81bcf4
  [ffff98c3f5643bd8] crash_kexec at ffffffffad71cf00
  [ffff98c3f5643bf0] oops_end at ffffffffadd6c758
  [ffff98c3f5643c18] die at ffffffffad62f95b
  [ffff98c3f5643c48] do_general_protection at ffffffffadd6c052
  [ffff98c3f5643c80] general_protection at ffffffffadd6b6f8
  [ffff98c3f5643cc8] __build_skb at ffffffffadc279da
  [ffff98c3f5643d08] kmem_cache_alloc at ffffffffad81bcf4
  [ffff98c3f5643d68] __build_skb at ffffffffadc279da
  [ffff98c3f5643d88] __netdev_alloc_skb at ffffffffadc27cc9
  [ffff98c3f5643dc0] vmxnet3_rq_rx_complete at ffffffffc036b5f6 [vmxnet3]
  [ffff98c3f5643e48] vmxnet3_poll_rx_only at ffffffffc036c0f6 [vmxnet3]
  [ffff98c3f5643e78] net_rx_action at ffffffffadc39e9f
  [ffff98c3f5643ef8] __do_softirq at ffffffffad6a0f05
  [ffff98c3f5643f68] call_softirq at ffffffffadd7832c
  [ffff98c3f5643f80] do_softirq at ffffffffad62e675
  [ffff98c3f5643fa0] irq_exit at ffffffffad6a1285
  [ffff98c3f5643fb8] do_IRQ at ffffffffadd795e6
  [ffff98c3f5643ff0] ret_from_intr at ffffffffadd6b362
--- <IRQ stack> ---
  [ffff98c1d3bbb208] ret_from_intr at ffffffffadd6b362
    [exception RIP: extract_dns_request+56]
    RIP: ffffffffc15ce188  RSP: ffff98c1d3bbb2b8  RFLAGS: 00000217
    RAX: 00000000003e3016  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 00000000003e3015  RSI: ffff98c1d3bbb330  RDI: 00000000ffffff8f
    RBP: ffff98c1d3bbb2b8   R8: 0000000000000000   R9: ffff98c3d4efde38
    R10: 00000000ffffff8f  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffcc  CS: 0010  SS: 0018
  [ffff98c1d3bbb2c0] handle_dns_query at ffffffffc15cebea [uniedr_edr]
    RIP: 10ffff98c3f75298  RSP: 0000000000000000  RFLAGS: 01000009
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: ff00000000000000
    RDX: 000000ffff0000ff  RSI: 0000000000000000  RDI: 4500000000000000
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 000000081e000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 3500000000000000
    ORIG_RAX: 00000006c0000001  CS: ffff98c3f75299  SS: 4000000000000000
bt: WARNING: possibly bogus exception frame

 总结:

  从 START项往下查看,可以看到由machine_kexec调用ffffffffad663674地址,接着发生一系列调用操作,直到触发crash_kexec(系统奔溃函数)后出现保护操作general_protection(与前面看到的奔溃类型一致 ---> PANIC: "general protection fault: 0000 [#1] SMP ")即可确认该日志报错信息与实际奔溃类型一致,而引起此次操作系统重启的原因是:

  COMMAND: "XXXX" 命令导致的(这里的XXXX是笔者人为替换的),而该命令正是操作系统内的应用引起的,因此可以将注意力集中在应用本身进行排查

 

标签:__,crash,虚拟机,0000000000000000,64,内核,RAX,root,VMware
From: https://www.cnblogs.com/Ky150/p/18310569

相关文章

  • 内核配置
    https://developer.aliyun.com/article/536805https://help.aliyun.com/zh/ecs/support/common-kernel-network-parameters-of-ecs-linux-instances-and-faq-hosts:allgather_facts:nobecome:yestasks:#-name:Insertcontentbelowaspecificline#......
  • 配置VMware静态IP
    方便远程办公,找IP1.在VMware中找到原有的IP,网关,子网掩码1.2点击虚拟网络编辑器-->点击NAT设置即可看到IP..........记住2.登陆root用户,打开终端编译2.1再ll进行查看2.2再cdnetwork-scripts2.3再viifcfg-ens332.4用到第一步中所查到的数据,IPADDR最后几位随意......
  • 在Linux中,编译内核的意义与步骤?
    在Linux系统中,编译内核是一个重要的操作,它允许用户根据自己的硬件配置和需求定制内核。以下是编译内核的意义和步骤的详细说明:1.编译内核的意义定制化:用户可以根据自己的硬件和需求选择内核的配置选项,从而优化系统性能。安全性:通过编译内核,可以去除不需要的模块和功能,减少潜......
  • RISCV内核中断优先级/Priority
    一、讲解中断优先级分为抢占优先级和响应优先级。配置参数越小,则说明其优先级别越高。抢占:是指可以打断其他中断函数的属性。出现该属性时会出现中断嵌套;响应:是指抢占优先级相同情况下,则优先执行响应优先级高的中断;二、举例序号中断名称优先级1TMR1102TMR21......
  • vmware安装在scsi磁盘上的grub无法发现/引导其它scsi磁盘
    #虚拟机盘结构virtual-disk-1(scsi,gpt)--fat32(EFI,grub)--ext4(empty)virutal-disk-2(scsi,gpt)--ext4(ubuntu,system,rootfs)virtual-disk-3(scsi,gpt)--ext4(empty)之前引导分区和ubuntu者在一个scsi盘上,所以没有注意到启动时有grub,进grubshell,但ls......
  • VMware Cloud Director Availability 4.7.2 | 灾难恢复和迁移 | DRaaS
    VMwareCloudDirectorAvailability4.7.2|灾难恢复和迁移|DRaaSOnboarding&DisasterRecoveryServices请访问原文链接:https://sysin.org/blog/vmware-cloud-director-availability-4/,查看最新版。原创作品,转载请保留出处。作者主页:sysin.orgVMwareCloudDirectorA......
  • 虚拟机网络配置最佳实践
    一、虚拟网卡配置1.1设置虚拟网卡点击VMwareNetworkAdapterVMnet8设置虚拟网卡注意要点:1.设置静态IP,地址为:192.168.2.1002.设置子网掩码(默认):255.255.255.03.设置默认网关:192.168.2.1(重要!!!这里要跟下面虚拟机网络设置中的NAT设置中的网关IP一致)2.3DHCP设置......
  • 深入理解Java虚拟机(JVM)及其内部原理
    深入理解Java虚拟机(JVM)及其内部原理大家好,我是微赚淘客系统3.0的小编,是个冬天不穿秋裤,天冷也要风度的程序猿!在Java开发中,了解Java虚拟机(JVM)的工作原理是非常重要的。本文将深入探讨JVM的内部结构和运行机制,帮助读者更好地理解和优化Java应用程序的性能。一、JVM的基本概念和组成......
  • 深入理解Linux内核中的同步与互斥的实现
    1.内联汇编汇编函数的执行效率比C语言更高,但可移植性,可编程性和可读性更差,掌握也更复杂。所以一般使用C语言编程。1.1内联汇编的优点性能优化:内联汇编允许开发者利用底层硬件特性,编写出更高效的代码,尤其是在性能敏感的场景下。直接硬件控制:内联汇编可以直接对硬件寄存......
  • linux内核中的HZ
    在Linux内核中,HZ 是一个非常重要的宏定义,它代表了内核的“心跳”频率,即每秒内核时钟中断的次数。这个值在不同的系统和架构上可能有所不同,但通常是一个固定的值,比如100、250或1000等,这取决于硬件的能力和内核的配置。3*HZ 顾名思义,就是 HZ 值的三倍。这个表达式在内核代码......