首页 > 其他分享 >Kubernetes-集群ETCD故障排查

Kubernetes-集群ETCD故障排查

时间:2022-12-24 20:01:40浏览次数:37  
标签:24 10 ETCD Kubernetes kubelet xuegod63 排查 etcd go

前言

记录一次本地虚拟机Kubernetes集群排障过程。

修复内容包括:1、Kubernetes证书过期,2、ETCD数据损坏

发现问题

[root@xuegod63 ~]# kubectl get nodes
The connection to the server localhost:8080 was refused - did you specify the right host or port?

排查开始

1、检查docker和kubelet的服务状态

[root@xuegod63 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-12-23 04:32:33 EST; 23h ago
Docs: https://docs.docker.com
Main PID: 1574 (dockerd)
Tasks: 12
Memory: 98.0M
CGroup: /system.slice/docker.service
└─1574 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

[root@xuegod63 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Sat 2022-12-24 03:51:24 EST; 5s ago
Docs: https://kubernetes.io/docs/
Process: 15700 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
Main PID: 15700 (code=exited, status=255)

2、初步发现kubelet异常,尝试重启失败后查看详细日志

[root@xuegod63 ~]# journalctl -xeu kubelet
Dec 24 03:52:36 xuegod63 kubelet[15795]: E1224 03:52:36.043715 15795 bootstrap.go:265] part of the existing bootstrap client certificate is expired: 2022-05-01 08:24:14 +0000 UTC
Dec 24 03:52:36 xuegod63 kubelet[15795]: F1224 03:52:36.043767 15795 server.go:269] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or
Dec 24 03:52:36 xuegod63 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Dec 24 03:52:36 xuegod63 kubelet[15795]: goroutine 1 [running]:
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc000010001, 0xc0002ce000, 0xb0, 0xb3)
Dec 24 03:52:36 xuegod63 kubelet[15795]: /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x70e94e0, 0xc000000003, 0x0, 0x0, 0xc0005f0380, 0x6f53cea, 0x9, 0x10d, 0x411b00)
Dec 24 03:52:36 xuegod63 kubelet[15795]: /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0x70e94e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000bd5940, 0x1, 0x1)
Dec 24 03:52:36 xuegod63 kubelet[15795]: /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:732 +0x16f
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).print(...)
Dec 24 03:52:36 xuegod63 kubelet[15795]: /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:714
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatal(...)

3、发现证书过期导致,查看证书过期时间

kubeadm certs check-expiration
或者:
for item in `find /etc/kubernetes/ -maxdepth 2 -name "*.crt"`;do echo "==========$item=========";openssl x509 -in $item -text -noout| grep Not;done

4、更新Kubernetes证书

#脚本转载自https://github.com/yuyicai/update-kube-cert
下载脚本
git clone https://github.com/yuyicai/update-kube-cert.git
cd update-kubeadm-cert
chmod 755 update-kubeadm-cert.sh
./update-kubeadm-cert.sh all

#证书回滚
脚本会自动备份 /etc/kubernetes 目录到 /etc/kubernetes.old-$(date +%Y%m%d) 目录(备份目录命名示例:kubernetes.old-20200325)
若更新证书失败需要回滚,手动将备份 /etc/kubernetes.old-$(date +%Y%m%d)目录覆盖 /etc/kubernetes 目录

5、验证证书过期时间是否更新

[root@xuegod63 ~]# kubeadm certs check-expiration
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Dec 24, 2023 09:50 UTC 364d no
apiserver Dec 21, 2032 09:47 UTC 9y ca no
apiserver-etcd-client Dec 21, 2032 09:47 UTC 9y etcd-ca no
apiserver-kubelet-client Dec 21, 2032 09:47 UTC 9y ca no
controller-manager.conf Dec 24, 2023 09:50 UTC 364d no
etcd-healthcheck-client Dec 21, 2032 09:47 UTC 9y etcd-ca no
etcd-peer Dec 21, 2032 09:47 UTC 9y etcd-ca no
etcd-server Dec 21, 2032 09:47 UTC 9y etcd-ca no
front-proxy-client Dec 21, 2032 09:47 UTC 9y front-proxy-ca no
scheduler.conf Dec 24, 2023 09:50 UTC 364d no

CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Apr 29, 2031 08:24 UTC 8y no
etcd-ca Apr 29, 2031 08:24 UTC 8y no
front-proxy-ca Apr 29, 2031 08:24 UTC 8y no

6、重新生成配置文件

#备份
[root@xuegod63 ~]# mkdir /tmp/k8s
[root@xuegod63 ~]# mv /etc/kubernetes/*.conf /tmp/k8s/
#重新生成所有配置
[root@xuegod63 ~]# kubeadm init phase kubeconfig all
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file

7、重启服务验证

#重启docker和kubelet
[root@xuegod63 ~]# systemctl restart docker && systemctl restart kubelet
#服务启动成功
[root@xuegod63 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Sat 2022-12-24 04:54:30 EST; 1min 34s ago
Docs: https://kubernetes.io/docs/
Main PID: 25514 (kubelet)
Tasks: 22
Memory: 56.4M
CGroup: /system.slice/kubelet.service
├─25514 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-inf...
└─27574 /opt/cni/bin/calico

Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.929825 25514 scope.go:111] [topologymanager] RemoveContainer - Container ID: 80edca1e3891b0c988ffe4812df55371313ce2872b1eb8c9b131c014ec666d72
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.973837 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.992489 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.992531 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:03 xuegod63 kubelet[25514]: E1224 04:56:03.997385 25514 remote_runtime.go:332] ContainerStatus "714c60ef254de88c7970df58e3b7320c86a7b1f13ef43251166bdc206d24d623" from runtime serv...6bdc206d24d623
Dec 24 04:56:03 xuegod63 kubelet[25514]: E1224 04:56:03.997434 25514 kuberuntime_manager.go:980] getPodContainerStatuses for pod "etcd-xuegod63_kube-system(ef09464435a105e3b9a55b43476955ae)" f...6bdc206d24d623
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.997499 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.997508 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:04 xuegod63 kubelet[25514]: I1224 04:56:04.270725 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:04 xuegod63 kubelet[25514]: I1224 04:56:04.558293 25514 kubelet.go:449] kubelet nodes not sync

#发现集群依然有问题
[root@xuegod63 ~]# kubectl get nodes
The connection to the server localhost:8080 was refused - did you specify the right host or port?

8、继续排查,以下问题为ETCD数据损坏

#查看kubelet日志发现kubelet nodes not sync
#查看apiserver日志
docker ps -a |grep -i api
docker logs -f 0a88e3a4226a
Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
I1224 10:14:17.637002 1 server.go:632] external host was not specified, using 192.168.124.63
I1224 10:14:17.637561 1 server.go:182] Version: v1.20.4
I1224 10:14:18.013836 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1224 10:14:18.013863 1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I1224 10:14:18.014008 1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
I1224 10:14:18.014537 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1224 10:14:18.014572 1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I1224 10:14:18.016042 1 client.go:360] parsed scheme: "endpoint"
I1224 10:14:18.016082 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 <nil> 0 <nil>}]
W1224 10:14:18.016589 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
I1224 10:14:19.015536 1 client.go:360] parsed scheme: "endpoint"
I1224 10:14:19.015645 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 <nil> 0 <nil>}]
W1224 10:14:19.016554 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1224 10:14:19.017501 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1224 10:14:20.018145 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1224 10:14:20.553762 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...


#日志显示连接不到ETCD的2379端口
#查看ETCD日志,定位到ETCD数据损坏
[root@xuegod63 ~]# docker logs -f c6ee77a0ff89
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2022-12-24 10:16:19.354078 I | etcdmain: etcd Version: 3.4.13
2022-12-24 10:16:19.354121 I | etcdmain: Git SHA: ae9734ed2
2022-12-24 10:16:19.354125 I | etcdmain: Go Version: go1.12.17
2022-12-24 10:16:19.354127 I | etcdmain: Go OS/Arch: linux/amd64
2022-12-24 10:16:19.354130 I | etcdmain: setting maximum number of CPUs to 3, total number of available CPUs is 3
2022-12-24 10:16:19.354180 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2022-12-24 10:16:19.354239 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2022-12-24 10:16:19.355042 I | embed: name = xuegod63
2022-12-24 10:16:19.355065 I | embed: data dir = /var/lib/etcd
2022-12-24 10:16:19.355070 I | embed: member dir = /var/lib/etcd/member
2022-12-24 10:16:19.355072 I | embed: heartbeat = 100ms
2022-12-24 10:16:19.355074 I | embed: election = 1000ms
2022-12-24 10:16:19.355077 I | embed: snapshot count = 10000
2022-12-24 10:16:19.355083 I | embed: advertise client URLs = https://192.168.124.63:2379
2022-12-24 10:16:19.355086 I | embed: initial advertise peer URLs = https://192.168.124.63:2380
2022-12-24 10:16:19.355090 I | embed: initial cluster =
2022-12-24 10:16:19.822156 I | etcdserver: recovered store from snapshot at index 370037
2022-12-24 10:16:19.823857 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0587e]

goroutine 1 [running]:
go.etcd.io/etcd/etcdserver.NewServer.func1(0xc000298e30, 0xc000296d80)
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:334 +0x3e
panic(0xee6840, 0xc00006f480)
/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc000195e20, 0x10c13b2, 0x2a, 0xc000296e50, 0x1, 0x1)
/home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/github.com/coreos/[email protected]/capnslog/pkg_logger.go:75 +0x135
go.etcd.io/etcd/etcdserver.NewServer(0x7ffc1b6d6e49, 0x8, 0x0, 0x0, 0x0, 0x0, 0xc00011af80, 0x1, 0x1, 0xc00011b100, ...)
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:464 +0x433c
go.etcd.io/etcd/embed.StartEtcd(0xc000260000, 0xc0000fa580, 0x0, 0x0)
/tmp/etcd-release-3.4.13/etcd/release/etcd/embed/etcd.go:214 +0x988
go.etcd.io/etcd/etcdmain.startEtcd(0xc000260000, 0x10963d6, 0x6, 0x1, 0xc0001f1140)
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:144 +0x2ef9
go.etcd.io/etcd/etcdmain.Main()
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
/tmp/etcd-release-3.4.13/etcd/release/etcd/main.go:28 +0x20

9、修复ETCD

#应为没有做数据备份,只能删除数据文件,来启动服务.PS:但是这样会导致pod资源全部丢失
rm -rf /var/lib/etcd/*
cp /etc/kubernetes/admin.conf ~/.kube/config
#重启docker和kubelet,问题修复
[root@xuegod63 ~]# systemctl restart docker && systemctl restart kubelet

10、附加ETCD备份恢复方式

#备份(二进制环境)
# export ETCDCTL_API=3
# etcdctl --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --cacert=/etc/kubernetes/ssl/ca.pem --endpoints=https://192.168.4.181:2379,https://192.168.4.182:2379,https://192.168.4.183:2379 snapshot save /mnt/aps/backup/etcd/snapshot$(date +%Y%m%d).db
#恢复
1.停止所有节点etcd服务
2.备份当前数据目录
3.还原数据,根据实际情况修改,并在所有ETCD节点执行恢复
# export ETCDCTL_API=3
# etcdctl --cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem \
--cacert=/etc/kubernetes/ssl/ca.pem \
--endpoints=https://192.168.4.181:2379,https://192.168.4.182:2379,https://192.168.4.183:2379 \
--name=etcd1 \
--initial-cluster=etcd1=https://192.168.4.181:2380,etcd2=https://192.168.4.182:2380,etcd3=https://192.168.4.183:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https://192.168.4.181:2380 \
--data-dir=/var/lib/etcddata snapshot restore snapshot20200630.db

标签:24,10,ETCD,Kubernetes,kubelet,xuegod63,排查,etcd,go
From: https://blog.51cto.com/u_14458428/5967462

相关文章

  • Kubernetes(k8s) kubectl taint常用命令
    kubectl在$HOME/.kube目录中查找一个名为config的配置文件。可以通过设置KUBECONFIG环境变量或设置--kubeconfig参数来指定其它kubeconfig文件。本文主要介绍K......
  • Docker和Kubernetes网络模型
    Docker网络模型Bridge模式(默认)Docker程序启动后会创建一个bridge0网桥,并分配一个IP,可以想象成一个虚拟的交换机,创建的容器实例都会通过虚拟网卡vethpair设备连接到这个......
  • Kubernetes-部署WordPress
    回顾Kubernetes把集群里的计算资源定义为节点(Node),其中又划分成控制面和数据面两类。控制面是Master节点,负责管理集群和运维监控应用,里面的核心组件是apiserver、etcd、s......
  • Kubernetes-ConfigMap/Secret
    想让业务更顺利地运行,有一个问题不容忽视,那就是应用的配置管理。Kubernetes里专门用来管理配置信息的两种对象:ConfigMap和Secret,使用它们来灵活地配置、定制我们的应用。C......
  • kubernetes-Job/CronJob
    离线任务“离线业务”可以分为两种。一种是“临时任务”,跑完就完事,下次有需求了说一声再重新安排;另一种是“定时任务”,可以按时按点周期运行,不需要过多干预。这分别对应了......
  • Etcd循序渐进
    前言规划1、主机信息节点名称节点IP软件版本OS版本server50192.168.4.50etcdcoentos4.19server51192.168.4.51etcdcoentos4.19server52192.1......
  • Kubernetes控制器的工作原理
    Kubernetes的核心就是控制理论,Kubernetes控制器中实现的控制回路是一种闭环反馈控制系统,该类型的控制系统基于反馈回路将目标系统的当前状态与预定义的期望状态相比较,二者......
  • Kubernetes控制器的工作原理
     Kubernetes的核心就是控制理论,Kubernetes控制器中实现的控制回路是一种闭环反馈控制系统,该类型的控制系统基于反馈回路将目标系统的当前状态与预定义的期望状态相比较,......
  • vscode 中切换 kubernetes namespace
    vscode快捷键参考链接macOS版本Windows版本备用链接:macOS版本Windows版本Windows客户端切换到etek-prod名称空间打开VisualStudioCode软件,进入Kubern......
  • 17_2 kubernetes CKA 模拟题总结
    做题前注意是否在要求的上下文#查看当前所在的contextkubectlconfigcurrent-context#输出kubernetes-admin@kubernetes#使用指定的contextkubectlconfigus......