首页 > 其他分享 >记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

时间:2023-02-13 15:02:04浏览次数:49  
标签:github 快照 虚机 liruilongs vms81 etcd io go

写在前面

  • 不小心拔错电源了,虚机强制关机,开机后集群死掉了
  • 记录下解决方案
  • 断电导致etcd 快照数据丢失,没有备份.基本上是没办法处理
  • 可以找专业的 DBA来处理数据看有没有可能恢复
  • 这篇博文的解决办法是删除了 etcd 数据目录中的部分文件。
  • 集群可以启动,但是 部署的环境数据都丢失了,包括CNI, 集群自带的 DNS 组件也丢了。
  • 理解不足小伙伴帮忙指正
  • <font color=red>不管是生产还是测试, k8s集群 ETCD 一定要备份,ETCD 一定要备份,ETCD 一定要备份 ,重要的话说三遍。</font>

<font color="009688"> 我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》</font>


当前集群的状态

┌──[[email protected]]-[~]
└─$kubectl get nodes
The connection to the server 192.168.26.81:6443 was refused - did you specify the right host or port?

重启 docke 和 kubelet 尝试启动

┌──[[email protected]]-[~]
└─$systemctl restart docker
┌──[[email protected]]-[~]
└─$systemctl restart kubelet.service

还是不行,查看下 maser 节点的 kubelet 日志信息

┌──[[email protected]]-[~]
└─$journalctl  -u kubelet.service -f
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.703418   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.804201   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.905156   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.005487   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.105648   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.186066   11344 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://192.168.26.81:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vms81.liruilongs.github.io?timeout=10s": dial tcp 192.168.26.81:6443: connect: connection refused
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.205785   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"

利用 docker 查看下当前存在的 pod 信息

┌──[[email protected]]-[~]
└─$docker ps
CONTAINER ID   IMAGE                                               COMMAND                  CREATED          STATUS              PORTS     NAMES
d9d6471ce936   b51ddc1014b0                                        "kube-scheduler --au…"   17 minutes ago   Up 17 minutes                 k8s_kube-scheduler_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_14
010c1b8c30c6   5425bcbd23c5                                        "kube-controller-man…"   17 minutes ago   Up 17 minutes                 k8s_kube-controller-manager_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_15
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up About a minute             k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
f557435d150e   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_7
5deaffbc555a   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_7
a418c2ce33f2   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-apiserver-vms81.liruilongs.github.io_kube-system_a35cb37b6c90c72f607936b33161eefe_6

etcd 没有启动, apiservice 也没有启动。

┌──[[email protected]]-[~]
└─$docker ps -a | grep etcd
b5e18722315b   004811815584                                        "etcd --advertise-cl…"   5 minutes ago    Exited (2) About a minute ago             k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 21 minutes ago   Up 4 minutes                              k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7

尝试重新启动 etcd

┌──[[email protected]]-[~]
└─$docker restart b5e18722315b
b5e18722315b

查看启动状态

┌──[[email protected]]-[~]
└─$docker ps -a | grep etcd
b5e18722315b   004811815584                                        "etcd --advertise-cl…"   5 minutes ago    Exited (2) About a minute ago             k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 21 minutes ago   Up 4 minutes                              k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[[email protected]]-[~]
└─$docker logs b5e18722315b

看一下 etcd 对应的日志

┌──[[email protected]]-[~]
└─$docker logs 8a53cbc545e4
..................................................
{"level":"info","ts":"2023-01-19T01:34:24.332Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"5.557212ms"}
{"level":"warn","ts":"2023-01-19T01:34:24.332Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000014-0000000000185aba.wal.broken"}
{"level":"info","ts":"2023-01-19T01:34:24.770Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":26912747,"snapshot-size":"42 kB"}
{"level":"warn","ts":"2023-01-19T01:34:24.771Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":26912747,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000019aa7eb.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2023-01-19T01:43:31.738Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000114600, 0xc000588240, 0x1, 0x1)
        /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000080960, 0x122e2fc, 0x2a, 0xc000588240, 0x1, 0x1)
        /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe54af1e25, 0x1a, 0x0, 0x0, 0x0, 0x0, 0xc0004cf830, 0x1, 0x1, 0xc0004cfa70, ...)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0000ee000, 0xc0000ee600, 0x0, 0x0)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0000ee000, 0x1202a6f, 0x6, 0xc000428401, 0x2)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a120, 0x12, 0x12)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a120, 0x12, 0x12)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40 +0x11f
main.main()
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32 +0x45

"msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","

"msg": "从快照恢复v3后台失败", "error": "未能找到数据库快照文件(snap: 快照文件不存在)","

断电照成数据文件损坏了,它希望从快照中恢复,但是没有快照。

额,这里没有备份,所以基本上是没有办法修复了。只能通过 kubeadm 重置集群了。

一些补救措施

如果说你希望通过一些其他的方式来启动集群,来获取一些当前集群的配置信息,下面的方式可以尝试,但是我的集群使用了下面的方法,所有的 pods 数据都丢失了,没办法最后重置集群了。

<font color=red>如果你想使用下面的方式,一定要备份删除的 etcd 数据文件</font>

etcdmaster 是一个静态 pod ,所以我们看下 yaml 文件,配置的数据文件中什么位置

┌──[[email protected]]-[~]
└─$cd /etc/kubernetes/manifests/
┌──[[email protected]]-[/etc/kubernetes/manifests]
└─$ls
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

- --data-dir=/var/lib/etcd

┌──[[email protected]]-[/etc/kubernetes/manifests]
└─$cat etcd.yaml | grep -e "--"
    - --advertise-client-urls=https://192.168.26.81:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://192.168.26.81:2380
    - --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.26.81:2380
    - --name=vms81.liruilongs.github.io
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

对应的数据文件,可以尝试对数据文件进行修复,如果希望集群可以快速启动,可以

┌──[[email protected]]-[/var/lib/etcd/member]
└─$tree
.
├── snap
│   ├── 0000000000000058-00000000019a0ba7.snap
│   ├── 0000000000000058-00000000019a32b8.snap
│   ├── 0000000000000058-00000000019a59c9.snap
│   ├── 0000000000000058-00000000019a80da.snap
│   ├── 0000000000000058-00000000019aa7eb.snap
│   └── db
└── wal
    ├── 0000000000000014-0000000000185aba.wal.broken
    ├── 0000000000000142-0000000001963c0e.wal
    ├── 0000000000000143-0000000001977bbe.wal
    ├── 0000000000000144-0000000001986aa6.wal
    ├── 0000000000000145-0000000001995ef6.wal
    ├── 0000000000000146-00000000019a544d.wal
    └── 1.tmp

2 directories, 13 files

备份一下数据文件

┌──[[email protected]]-[/var/lib/etcd]
└─$ls
member
┌──[[email protected]]-[/var/lib/etcd]
└─$tar -cvf member.tar member/
member/
member/snap/
member/snap/db
member/snap/0000000000000058-00000000019a0ba7.snap
member/snap/0000000000000058-00000000019a32b8.snap
member/snap/0000000000000058-00000000019a59c9.snap
member/snap/0000000000000058-00000000019a80da.snap
member/snap/0000000000000058-00000000019aa7eb.snap
member/wal/
member/wal/0000000000000142-0000000001963c0e.wal
member/wal/0000000000000144-0000000001986aa6.wal
member/wal/0000000000000014-0000000000185aba.wal.broken
member/wal/0000000000000145-0000000001995ef6.wal
member/wal/0000000000000146-00000000019a544d.wal
member/wal/1.tmp
member/wal/0000000000000143-0000000001977bbe.wal
┌──[[email protected]]-[/var/lib/etcd]
└─$ls
member  member.tar
┌──[[email protected]]-[/var/lib/etcd]
└─$mv member.tar  /tmp/
┌──[[email protected]]-[/var/lib/etcd]
└─$
┌──[[email protected]]-[/var/lib/etcd]
└─$rm -rf  member/snap/*.snap
┌──[[email protected]]-[/var/lib/etcd]
└─$rm -rf  member/wal/*.wal
┌──[[email protected]]-[/var/lib/etcd]
└─$

重新启动 docker 对应的镜像,或者重新启动 kubectl。

┌──[[email protected]]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
a3b97cb34d9b   004811815584                                        "etcd --advertise-cl…"   2 minutes ago   Exited (2) 2 minutes ago              k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 3 hours ago     Up 2 hours                            k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[[email protected]]-[/var/lib/etcd]
└─$docker start a3b97cb34d9b
a3b97cb34d9b
┌──[[email protected]]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
e1fc068247af   004811815584                                        "etcd --advertise-cl…"   3 seconds ago   Up 2 seconds                          k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_46
a3b97cb34d9b   004811815584                                        "etcd --advertise-cl…"   3 minutes ago   Exited (2) 3 seconds ago              k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 3 hours ago     Up 2 hours                            k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[[email protected]]-[/var/lib/etcd]
└─$

查看 Node 状态

┌──[[email protected]]-[/var/lib/etcd]
└─$kubectl get nodes
NAME                          STATUS   ROLES    AGE   VERSION
vms155.liruilongs.github.io   Ready    <none>   76s   v1.22.2
vms81.liruilongs.github.io    Ready    <none>   76s   v1.22.2
vms82.liruilongs.github.io    Ready    <none>   76s   v1.22.2
vms83.liruilongs.github.io    Ready    <none>   76s   v1.22.2
┌──[[email protected]]-[/var/lib/etcd]
└─$

查看集群当前所有的 Pod 。

┌──[[email protected]]-[~/ansible/kubevirt]
└─$kubectl get pods -A
NAME                                                 READY   STATUS    RESTARTS         AGE
etcd-vms81.liruilongs.github.io                      1/1     Running   48 (3h35m ago)   3h53m
kube-apiserver-vms81.liruilongs.github.io            1/1     Running   48 (3h35m ago)   3h51m
kube-controller-manager-vms81.liruilongs.github.io   1/1     Running   17 (3h35m ago)   3h51m
kube-scheduler-vms81.liruilongs.github.io            1/1     Running   16 (3h35m ago)   3h52m

网络相关的 pod 都不在了,而且 k8s 的 dns 组件也没有起来, 这里需要 重新配置网络,有点麻烦,正常情况下如果, 网络相关的组件没有起来, 所有节点应该都是未就绪状态。感觉有点妖。。。时间关系,我需要集群来做实验,所以通过 kubeadm重置了

┌──[[email protected]]-[~/ansible]
└─$kubectl apply -f calico.yaml

博文参考


https://github.com/etcd-io/etcd/issues/11949

标签:github,快照,虚机,liruilongs,vms81,etcd,io,go
From: https://blog.51cto.com/liruilong/6054149

相关文章