首页 > 其他分享 >Kubernetes 集群灾备环境部署

Kubernetes 集群灾备环境部署

时间:2023-09-12 14:35:39浏览次数:42  
标签:https Kubernetes -- 集群 etcd 172.16 k8s data 灾备

etcd 是 kubernetes 集群极为重要的一块服务,存储了kubernetes集群所有的数据信息,如 Namespace、Pod、Service、路由等状态信息。如果 etcd 集群发生灾难或者 etcd 集群数据丢失,都会影响k8s集群数据的恢复。因此,通过备份 etcd 数据来实现kubernetes 集群的灾备环境十分重要。

一、etcd集群备份

etcd 不同版本的 etcdctl 命令不一样,但大致差不多,这里备份使用 napshot save 进行快照备份。

需要注意几点:

  • 备份操作在 etcd 集群的其中一个节点执行就可以。
  • 这里使用的是 etcd v3 的 api,因为从 k8s 1.13 开始,k8s 不再支持 v2 版本的 etcd,即 k8s 的集群数据都存在了 v3 版本的 etcd 中。故备份的数据也只备份了使用 v3 添加的 etcd 数据,v2 添加的 etcd 数据是没有做备份的。
  • 本案例使用的是二进制部署的 k8s v1.18.6 + Calico 容器环境(下面命令中的”ETCDCTL_API=3 etcdctl” 等同于 “etcdctl”)

1)开始备份之前,先来查看下etcd数据

etcd 数据目录
[root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh |grep "ETCD_DATA_DIR="
export ETCD_DATA_DIR="/data/k8s/etcd/data"


etcd WAL 目录
[root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh |grep "ETCD_WAL_DIR="
export ETCD_WAL_DIR="/data/k8s/etcd/wal"


[root@k8s-master01 ~]# ls /data/k8s/etcd/data/
member
[root@k8s-master01 ~]# ls /data/k8s/etcd/data/member/
snap
[root@k8s-master01 ~]# ls /data/k8s/etcd/wal/
0000000000000000-0000000000000000.wal  0.tmp

2)执行etcd集群数据备份

在etcd集群的其中一个节点执行备份操作,然后将备份文件拷贝到其他节点上。

先在etcd集群的每个节点上创建备份目录

# mkdir -p /data/etcd_backup_dir

在 etcd 集群其中个一个节点(这里在k8s-master01)上执行备份:

[root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --endpoints=https://172.16.60.231:2379 snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db

Kubernetes 集群灾备环境部署_API

将备份文件拷贝到其他的 etcd 节点

[root@k8s-master01 ~]# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/
[root@k8s-master01 ~]# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/

可以将上面 k8s-master01 节点的 etcd 备份命令放在脚本里,结合 crontab 进行定时备份:

[root@k8s-master01 ~]# cat /data/etcd_backup_dir/etcd_backup.sh
#!/usr/bin/bash


date;
CACERT="/etc/kubernetes/cert/ca.pem"
CERT="/etc/etcd/cert/etcd.pem"
EKY="/etc/etcd/cert/etcd-key.pem"
ENDPOINTS="172.16.60.231:2379"


ETCDCTL_API=3 /opt/k8s/bin/etcdctl \
--cacert="${CACERT}" --cert="${CERT}" --key="${EKY}" \
--endpoints=${ENDPOINTS} \
snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db


# 备份保留30天
find /data/etcd_backup_dir/ -name "*.db" -mtime +30 -exec rm -f {} \;


# 同步到其他两个etcd节点
/bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master02:/data/etcd_backup_dir/
/bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master03:/data/etcd_backup_dir/
设置 crontab 定时备份任务,每天凌晨5点执行备份:
[root@k8s-master01 ~]# chmod 755 /data/etcd_backup_dir/etcd_backup.sh
[root@k8s-master01 ~]# crontab -l
#etcd集群数据备份
0 5 * * * /bin/bash -x /data/etcd_backup_dir/etcd_backup.sh > /dev/null 2>&1

二、etcd 集群恢复

etcd 集群备份操作只需要在其中的一个etcd节点上完成,然后将备份文件拷贝到其他节点。

但etcd集群恢复操作必须要所有的etcd节点上完成!

1)模拟 etcd 集群数据丢失
删除三个etcd集群节点的data数据 (或者直接删除data目录)

# rm -rf /data/k8s/etcd/data/*

查看 k8s 集群状态:

[root@k8s-master01 ~]# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                           ERROR
etcd-2               Unhealthy   Get https://172.16.60.233:2379/health: dial tcp 172.16.60.233:2379: connect: connection refused
etcd-1               Unhealthy   Get https://172.16.60.232:2379/health: dial tcp 172.16.60.232:2379: connect: connection refused
etcd-0               Unhealthy   Get https://172.16.60.231:2379/health: dial tcp 172.16.60.231:2379: connect: connection refused
scheduler            Healthy     ok
controller-manager   Healthy     ok

由于此时 etcd 集群的三个节点服务还在,过一会儿查看集群状态恢复正常:

[root@k8s-master01 ~]# kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-0               Healthy   {"health":"true"}
etcd-2               Healthy   {"health":"true"}
etcd-1               Healthy   {"health":"true"}


[root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
https://172.16.60.231:2379 is healthy: successfully committed proposal: took = 9.918673ms
https://172.16.60.233:2379 is healthy: successfully committed proposal: took = 10.985279ms
https://172.16.60.232:2379 is healthy: successfully committed proposal: took = 13.422545ms


[root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem member list --write-out=table
+------------------+---------+------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |    NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+------------+----------------------------+----------------------------+------------+
| 1d1d7edbba38c293 | started | k8s-etcd03 | https://172.16.60.233:2380 | https://172.16.60.233:2379 |      false |
| 4c0cfad24e92e45f | started | k8s-etcd02 | https://172.16.60.232:2380 | https://172.16.60.232:2379 |      false |
| 79cf4f0a8c3da54b | started | k8s-etcd01 | https://172.16.60.231:2380 | https://172.16.60.231:2379 |      false |
+------------------+---------+------------+----------------------------+----------------------------+------------+

如上发现,etcd集群三个节点的leader都是false,即没有选主。此时需要重启三个节点的etcd服务:

# systemctl restart etcd

重启后,再次查看发现etcd集群已经选主成功,集群状态正常!

[root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.16.60.231:2379 | 79cf4f0a8c3da54b |   3.4.9 |  1.6 MB |      true |      false |         5 |      24658 |              24658 |        |
| https://172.16.60.232:2379 | 4c0cfad24e92e45f |   3.4.9 |  1.6 MB |     false |      false |         5 |      24658 |              24658 |        |
| https://172.16.60.233:2379 | 1d1d7edbba38c293 |   3.4.9 |  1.7 MB |     false |      false |         5 |      24658 |              24658 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

但是,k8s集群数据其实已经丢失了。namespace命名空间下的pod等资源都没有了。此时就需要通过etcd集群备份文件来恢复,即通过上面的etcd集群快照文件恢复。

[root@k8s-master01 ~]# kubectl get ns
NAME              STATUS   AGE
default           Active   9m47s
kube-node-lease   Active   9m39s
kube-public       Active   9m39s
kube-system       Active   9m47s
[root@k8s-master01 ~]# kubectl get pods -n kube-system
No resources found in kube-system namespace.
[root@k8s-master01 ~]# kubectl get pods --all-namespaces
No resources found

2)etcd 集群数据恢复,即 kubernetes 集群数据恢复
在 etcd 数据恢复之前,先依次关闭所有 master 节点的 kube-aposerver 服务,所有etcd 节点的 etcd 服务:

# systemctl stop kube-apiserver
# systemctl stop etcd

特别注意:在进行 etcd 集群数据恢复之前,一定要先将所有 etcd 节点的 data 和 wal 旧工作目录删掉,这里指的是/data/k8s/etcd/data文件夹跟/data/k8s/etcd/wal文件夹,可能会导致恢复失败(恢复命令执行时报错数据目录已存在)。

# rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal

在每个etcd节点执行恢复操作:

172.16.60.231节点
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd01 \
--endpoints="https://172.16.60.231:2379" \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--cacert=/etc/kubernetes/cert/ca.pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https://172.16.60.231:2380 \
--initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 \
--data-dir=/data/k8s/etcd/data \
--wal-dir=/data/k8s/etcd/wal \
snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db




172.16.60.232节点
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd02 \
--endpoints="https://172.16.60.232:2379" \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--cacert=/etc/kubernetes/cert/ca.pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https://172.16.60.232:2380 \
--initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 \
--data-dir=/data/k8s/etcd/data \
--wal-dir=/data/k8s/etcd/wal \
snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db




192.168.137.233节点
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd03 \
--endpoints="https://192.168.137.233:2379" \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--cacert=/etc/kubernetes/cert/ca.pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https://192.168.137.233:2380 \
--initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 \
--data-dir=/data/k8s/etcd/data \
--wal-dir=/data/k8s/etcd/wal \
snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db

依次启动所有etcd节点的etcd服务:

# systemctl start etcd
# systemctl status etcd

检查 ETCD 集群状态(如下,发现etcd集群里已经成功选主了)

[root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
https://172.16.60.232:2379 is healthy: successfully committed proposal: took = 12.837393ms
https://172.16.60.233:2379 is healthy: successfully committed proposal: took = 13.306671ms
https://172.16.60.231:2379 is healthy: successfully committed proposal: took = 13.602805ms


[root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.16.60.231:2379 | 79cf4f0a8c3da54b |   3.4.9 |  9.0 MB |     false |      false |         2 |         13 |                 13 |        |
| https://172.16.60.232:2379 | 4c0cfad24e92e45f |   3.4.9 |  9.0 MB |      true |      false |         2 |         13 |                 13 |        |
| https://172.16.60.233:2379 | 5f70664d346a6ebd |   3.4.9 |  9.0 MB |     false |      false |         2 |         13 |                 13 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

再依次启动所有master节点的kube-apiserver服务:

# systemctl start kube-apiserver
# systemctl status kube-apiserver

查看kubernetes集群状态:

[root@k8s-master01 ~]# kubectl get cs
NAME                 STATUS      MESSAGE                                  ERROR
controller-manager   Healthy     ok
scheduler            Healthy     ok
etcd-2               Unhealthy   HTTP probe failed with statuscode: 503
etcd-1               Unhealthy   HTTP probe failed with statuscode: 503
etcd-0               Unhealthy   HTTP probe failed with statuscode: 503


由于etcd服务刚重启,需要多刷几次状态就会正常:
[root@k8s-master01 ~]# kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-2               Healthy   {"health":"true"}
etcd-0               Healthy   {"health":"true"}
etcd-1               Healthy   {"health":"true"}

查看kubernetes的资源情况:

[root@k8s-master01 ~]# kubectl get ns
NAME              STATUS   AGE
default           Active   7d4h
kevin             Active   5d18h
kube-node-lease   Active   7d4h
kube-public       Active   7d4h
kube-system       Active   7d4h


[root@k8s-master01 ~]# kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS              RESTARTS   AGE
default       dnsutils-ds-22q87                          0/1     ContainerCreating   171        7d3h
default       dnsutils-ds-bp8tm                          0/1     ContainerCreating   138        5d18h
default       dnsutils-ds-bzzqg                          0/1     ContainerCreating   138        5d18h
default       dnsutils-ds-jcvng                          1/1     Running             171        7d3h
default       dnsutils-ds-xrl2x                          0/1     ContainerCreating   138        5d18h
default       dnsutils-ds-zjg5l                          1/1     Running             0          7d3h
default       kevin-t-84cdd49d65-ck47f                   0/1     ContainerCreating   0          2d2h
default       nginx-ds-98rm2                             1/1     Running             2          7d3h
default       nginx-ds-bbx68                             1/1     Running             0          7d3h
default       nginx-ds-kfctv                             0/1     ContainerCreating   1          5d18h
default       nginx-ds-mdcd9                             0/1     ContainerCreating   1          5d18h
default       nginx-ds-ngqcm                             1/1     Running             0          7d3h
default       nginx-ds-tpcxs                             0/1     ContainerCreating   1          5d18h
kevin         nginx-ingress-controller-797ffb479-vrq6w   0/1     ContainerCreating   0          5d18h
kevin         test-nginx-7d4f96b486-qd4fl                0/1     ContainerCreating   0          2d1h
kevin         test-nginx-7d4f96b486-qfddd                0/1     Running             0          2d1h
kube-system   calico-kube-controllers-578894d4cd-9rp4c   1/1     Running             1          7d3h
kube-system   calico-node-d7wq8                          0/1     PodInitializing     1          7d3h

在etcd集群数据恢复后,pod容器也会慢慢恢复到running状态。至此,kubernetes整个集群已经通过etcd备份数据恢复了。

三、最后总结

Kubernetes 集群备份主要是备份 ETCD 集群。而恢复时,主要考虑恢复整个顺序:

停止kube-apiserver → 停止ETCD → 恢复数据 → 启动ETCD → 启动kube-apiserve

特别注意:

  • 备份 ETCD 集群时,只需要备份一个 ETCD 数据,然后同步到其他节点上。
  • 恢复 ETCD 数据时,拿其中一个节点的备份数据恢复即可。

标签:https,Kubernetes,--,集群,etcd,172.16,k8s,data,灾备
From: https://blog.51cto.com/u_64214/7445258

相关文章

  • K8S集群常见问题总结 集群服务访问失败 集群服务访问失败 集群服务暴露失败 外网无法
    问题1:K8S集群服务访问失败?原因分析:证书不能被识别,其原因为:自定义证书,过期等。解决方法:更新证书即可。问题2:K8S集群服务访问失败?curl:(7)Failedconnectto10.103.22.158:3000;Connectionrefused原因分析:端口映射错误,服务正常工作,但不能提供服务。解决方法:删除svc,重新映射端口......
  • Kubernetes 集群的优化 节点配额和内核参数调整 自动增加etcd节点 Kube APIServer 配
    一、节点配额和内核参数调整对于公有云上的Kubernetes集群,规模大了之后很容器碰到配额问题,需要提前在云平台上增大配额。这些需要增大的配额包括:虚拟机个数vCPU个数内网IP地址个数公网IP地址个数安全组条数路由表条数持久化存储大小参考gce随着node节点的增加master节点的配......
  • 图解 Kubernetes 中应用平滑升级4种方式
    如果你已经使用Kubernetes一段时间了,则可能需要考虑计划定期升级。从Kubernetes1.19开始,每个开源版本都提供一年的补丁。你需要升级到最新的可用次要版本或补丁版本才能获得安全性和错误修复。但是,如何在不停机的情况下升级基础架构的关键部分呢?本文将指导你了解在任何环境中......
  • kubernetes部署mongoDB 单机版 自定义配置文件、密码、日志路径等
    来源:https://aijishu.com/a/1060000000097166官方镜像地址: https://hub.docker.com/_/mong...docker版的mongo移除了默认的/etc/mongo.conf,修改了db数据存储路径为/data/db.创建configmap配置,注意不能加fork=true,否则Pod会变成Completed。apiVersion:v1kind:ConfigMap......
  • ElasticSearch 8.6集群搭建过程​
    ElasticSearch8.6集群搭建过程一、系统信息操作系统版本:CentOSLinuxrelease8.4.2105elasticsearch版本:8.6.1机器信息:主机名ip地址CPU内存(G)数据盘es01192.168.205.251632/data/(500G)es02192.168.205.261632/data/(500G)es03192.168.205.271632/data/(500G)二、操作......
  • k8s集群-spring cloud 集成seata的时候:can not register RM,err:can not connect to s
    背景说明seate和其他微服务在k8s集群中部署,都在同一个命名空间。注册到nacos的同一个命名空间seate是官方提供,无改动k8s中部署情况:报错提示core服务的报错内容:2023-09-1211:07:06.524ERROR7---[eoutChecker_2_1]i.s.c.r.netty.NettyClientChannelManager:0101c......
  • mysql - 集群
    概念mysql集群大致有这几种应用:单点写入,多点同时读;单点写入,另一个备用;多点同时写,允许这么做,但是不推荐,冲突不好解决。基本原理主库(master)在事务提交时,将数据的变化事件(events)记录在二进制日志文件(binlog)中。主库推送“binlog中的日志事件”到从库的“中继日志(relay......
  • 使用mysql-proxy代理mysql集群
    本文系统环境为:CentOSLinuxrelease7.9.2009(Core)安装mysql-proxy下载地址:dev.mysql.com/downloads/m… 服务信息 markdown复制代码mysql-proxy:192.168.1.113:4040//目标配置信息已搭建mysql数据库(双主):192.168.1.113:3306192.168.1.113:3307......
  • 如何像 Sealos 一样在浏览器中打造一个 Kubernetes 终端?
    作者:槐佳辉。Sealosmaintainer在Kubernetes的世界中,命令行工具(如kubectl和helm)是我们与集群交互的主要方式。然而,有时候,我们可能希望能够在Web页面中直接打开一个终端,执行这些命令,而不需要在本地环境中安装和配置这些工具。本文将深入探讨如何通过Kubernetes自定义资......
  • 在VMware虚拟机集群上部署HDFS集群
    本篇博客跟大家分享一下如何在VMware虚拟机集群上部署HDFS集群一·、下载hadoop安装包进入官网:https://hadoop.apache.org 下载hadoop安装包由于ApacheHadoop是国外网址,下载安装包对于网络要求较高 二、上传压缩包,进行解压在进行解压之前,保证自己已经完成vmwa的黄静配置 三、......