1. check if service of container runtime -- containerd -- is running on all computers who want to join the kubernetes cluster.
$ systemctl status containerd
● containerd.service - containerd container runtime
Loaded: loaded (/usr/local/lib/systemd/system/containerd.service; enabled;>
Active: active (running) since Fri 2024-02-09 15:10:34 CST; 12min ago
Docs: https://containerd.io
Process: 906 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUC>
Main PID: 915 (containerd)
Tasks: 385
Memory: 221.1M
CPU: 24.201s
CGroup: /system.slice/containerd.service
├─ 915 /usr/local/bin/containerd
├─2593 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─2613 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─2614 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─2615 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─3230 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─3588 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─5795 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─5840 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─5873 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─5919 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─6415 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
├─6969 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -i>
lines 1-23
2. initialize the control-plane node
kubectl init --image
repository=registry.aliyuncs.com/google_containers --pod-network-cidr=10.244.0.0/16
kubeadm init first runs a series of prechecks to ensure that the machine is ready to run Kubernetes. These prechecks expose warnings and exit on errors. kubeadm init then downloads and installs the cluster control plane components. This may take several minutes. After it finishes you should see:
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a Pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root:
kubeadm join <control-plane-host>:<control-plane-port> --token <token> --discovery-token-ca-cert-hash sha256:<hash>
To make kubectl work for your non-root user, run these commands, which are also part of the kubeadm init output:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Make a record of the kubeadm join command that kubeadm init outputs. You need this command to join nodes to your cluster.
The token is used for mutual authentication between the control-plane node and the joining nodes. The token included here is secret. Keep it safe, because anyone with this token can add authenticated nodes to your cluster. These tokens can be listed, created, and deleted with the kubeadm token command.
After kubeadm init
:
(base) maye@maye-Inspiron-5547:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-5bbd96d687-lwh2x 0/1 ContainerCreating 0 90s
kube-system coredns-5bbd96d687-qhc5c 0/1 ContainerCreating 0 90s
kube-system etcd-maye-inspiron-5547 1/1 Running 2 105s
kube-system kube-apiserver-maye-inspiron-5547 1/1 Running 1 103s
kube-system kube-controller-manager-maye-inspiron-5547 1/1 Running 0 102s
kube-system kube-proxy-54q76 1/1 Running 0 90s
kube-system kube-scheduler-maye-inspiron-5547 1/1 Running 1 109s
(base) maye@maye-Inspiron-5547:~$ ^C
Note:
- The default image repository "registry.k8s.io/" is not accessible in china, specifying an accessible image repository in china:
--image repository=registry.aliyuncs.com/google_containers
. - If
--pod-network-cidr=10.244.0.0/16
is not specified, the error:
pods of flannel not ready
$ journalctl -fu kubelet --->
"loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory"
will be raised when deploying pod-network addon -- flannel.
Attention:
You MUST disable swap if the kubelet is not properly configured to use swap. For example, sudo swapoff -a will disable swapping temporarily. To make this change persistent across reboots, make sure swap is disabled in config files like /etc/fstab, systemd.swap, depending how it was configured on your system.
Note:
The control-plane node is the machine where the control plane components run, including etcd (the cluster database) and the API Server (which the kubectl command line tool communicates with).
3. deploy a pod network addon -- flannel
kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml
If you use custom podCIDR (not 10.244.0.0/16) you first need to download the above manifest and modify the network to match your one.
After kubeadm init and deploy flannel :
(base) maye@maye-Inspiron-5547:/run/flannel$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-flannel kube-flannel-ds-gtzmm 1/1 Running 0 81s
kube-system coredns-66f779496c-rc7nc 1/1 Running 0 7m53s
kube-system coredns-66f779496c-zlc5c 1/1 Running 0 7m52s
kube-system etcd-maye-inspiron-5547 1/1 Running 6 8m8s
kube-system kube-apiserver-maye-inspiron-5547 1/1 Running 7 8m12s
kube-system kube-controller-manager-maye-inspiron-5547 1/1 Running 0 8m9s
kube-system kube-proxy-gfp8z 1/1 Running 0 7m53s
kube-system kube-scheduler-maye-inspiron-5547 1/1 Running 8 8m8s
(base) maye@maye-Inspiron-5547:/run/flannel$
Note:
- Flannel is an overlay network provider that can be used with Kubernetes.
- Flannel can be added to any existing Kubernetes cluster though it's simplest to add flannel before any pods using the pod network have been started.
4. Control plane node isolation
By default, your cluster will not schedule Pods on the control plane nodes for security reasons. If you want to be able to schedule Pods on the control plane nodes, for example for a single machine Kubernetes cluster, run:
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
The output will look something like:
node "test-01" untainted
...
This will remove the node-role.kubernetes.io/control-plane:NoSchedule taint from any nodes that have it, including the control plane nodes, meaning that the scheduler will then be able to schedule Pods everywhere.
5. Joining your nodes
The nodes are where your workloads (containers and Pods, etc) run. To add new nodes to your cluster do the following for each machine:
SSH to the machine
Become root (e.g. sudo su -)
Install and start a container runtime if not have one.
Run the command that was output by kubeadm init. For example:
kubeadm join --token <token> <control-plane-host>:<control-plane-port> --discovery-token-ca-cert-hash sha256:<hash>
A few seconds later, you should notice this node in the output from kubectl get nodes when run on the control-plane node.
If you do not have the token, you can get it by running the following command on the control-plane node:
kubeadm token list
The output is similar to this:
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
8ewj1p.9r9hcjoqgajrj4gi 23h 2018-06-12T02:51:28Z authentication, The default bootstrap system:
signing token generated by bootstrappers:
'kubeadm init'. kubeadm:
default-node-token
By default, tokens expire after 24 hours. If you are joining a node to the cluster after the current token has expired, you can create a new token by running the following command on the control-plane node:
kubeadm token create
The output is similar to this:
5didvk.d09sbcov8ph2amjw
If you don't have the value of --discovery-token-ca-cert-hash, you can get it by running the following command chain on the control-plane node:
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | \
openssl dgst -sha256 -hex | sed 's/^.* //'
The output is similar to:
8cb2de97839780a412b93877f8507ad6c94f73add17d5d7058e91741c9d5ec78
After join one node to the cluster:
(base) maye@maye-Inspiron-5547:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
maye-inspiron-5547 Ready control-plane 141m v1.28.4
maye-laptop Ready <none> 6m31s v1.28.4
(base) maye@maye-Inspiron-5547:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-flannel kube-flannel-ds-gtzmm 1/1 Running 0 138m
kube-flannel kube-flannel-ds-mvk6h 1/1 Running 0 10m
kube-system coredns-66f779496c-rc7nc 1/1 Running 0 144m
kube-system coredns-66f779496c-zlc5c 1/1 Running 0 144m
kube-system etcd-maye-inspiron-5547 1/1 Running 6 144m
kube-system kube-apiserver-maye-inspiron-5547 1/1 Running 7 144m
kube-system kube-controller-manager-maye-inspiron-5547 1/1 Running 0 144m
kube-system kube-proxy-gdvhc 1/1 Running 0 10m
kube-system kube-proxy-gfp8z 1/1 Running 0 144m
kube-system kube-scheduler-maye-inspiron-5547 1/1 Running 8 144m
Note:
- As the cluster nodes are usually initialized sequentially, the CoreDNS Pods are likely to all run on the first control-plane node. To provide higher availability, please rebalance the CoreDNS Pods with kubectl -n kube-system rollout restart deployment coredns after at least one new node is joined.
6. clean up
6.1 drain nodes which are not control plane, namely remove pods on the node.
kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets
6.2 reset the state installed by kubeadm on the node:
kubeadm reset
The reset process does not reset or clean up iptables rules or IPVS tables. If you wish to reset iptables, you must do so manually:
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
If you want to reset the IPVS tables, you must run the following command:
ipvsadm -C
6.3 Now remove the node:
kubectl delete node <node name>
6.4 Clean up the control plane
You can use kubeadm reset
on the control plane host to trigger a best-effort clean up.
If you wish to start over, run kubeadm init
or kubeadm join
with the appropriate arguments.
7. Debug a pod
7.1 check details of the pod using `kubectl describe pod -n ', Error message will be shown in "Events: " of the output.
(base) maye@maye-Inspiron-5547:~$ kubectl describe pod kube-flannel-ds-mvk6h -n kube-flannel
...
Events:
Type Reason Age From Message
Normal Scheduled 11m default-scheduler Successfully assigned kube-flannel/kube-flannel-ds-mvk6h to maye-laptop
Normal Pulled 11m kubelet Container image "docker.io/flannel/flannel-cni-plugin:v1.2.0" already present on machine
Normal Created 11m kubelet Created container install-cni-plugin
Normal Started 11m kubelet Started container install-cni-plugin
Normal Pulled 11m kubelet Container image "docker.io/flannel/flannel:v0.23.0" already present on machine
Normal Created 11m kubelet Created container install-cni
Normal Started 10m kubelet Started container install-cni
Normal Pulled 10m kubelet Container image "docker.io/flannel/flannel:v0.23.0" already present on machine
Normal Created 10m kubelet Created container kube-flannel
Normal Started 10m kubelet Started container kube-flannel
(base) maye@maye-Inspiron-5547:~$
7.2 check log of containers running in the pod:
kubectl logs <pod-name> -n <namespace>
7.3 check log of kubelet, who manages pods and cotainers.
$ journalctl -fu kubelet
journalctl -- see the log of processes started by systemd
-f -- 实时查看新增的条目
-u -- 指定看哪个 service unit 的log
or,
$ gedit /var/log/syslog
ctrl + f 查找关键字, 如kubelet
8. Error & Solution
[ERROR: running with swap on is not supported.]
[SOLUTION]
$ sudo gedit /etc/fstab
comment out the line starting with '/swapfile'.
[ERROR: failed to pull image registry.k8s.io/kube-apiserver:v1.28]
[SOLUTION]
This step is to download the image needed by kubenetes, and the default repository is registry.k8s.io, which is not accessible in mainland of china, use aliyun repository mirror -- registry.aliyuncs.com/google_containers -- instead:
$ kubeadm init --image-repository=registry.aliyuncs.com/google_containers
[ERROR kubelet not running: kubelet is not running or not healthy.]
[SOLUTION]
This error is caused by that kubelet and containerd use different cgroup driver, set both to systemd.
Set containerd's cgroup driver:
$ sudo gedit /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"
If /etc/containerd/config.toml not exist, create it by:
$ mkdir -p /etc/containerd
$ containerd config defaut | sudo tee /etc/containerd/config.toml
Set kubelet's cgroup driver:
# kubeadm-config.yaml
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.21.0
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
kubeadm init --config kubeadm-config.yaml
[ERROR Port is in use: Port 10250 is in use.]
[SOLUTION]
clean up the cluster and then restart a cluster.
$ sudo systemctl stop kubelet.service
$ kubeadm reset
$ kubeadm init --image-repository=registry.aliyuncs.com/google_containers
--pod-network-cidr=10.244.0.0/16
[ERROR File alreay exists: /etc/kubernetes/manifests/kube-apiserver.yaml already exists]
[SOLUTION]
$ rm -rf /etc/kubernetes/manifests/*
$ rm -rf /var/lib/etcd
[ERROR pod status is stuck at ContainerCreating: failed to pull image "registry.k8s.io/pause:3.8]
$ kubectl get pods -n kube-system
kube-proxy-s729z 0/1 ContainerCreating
$ kubectl describe pod kube-proxy-s729z -n kube-system
Event:
failed to get sandbox image "registry.k8s.io/pause:3.8": failed to pull image "registry.k8s.io/pause:3.8"
[SOLUTION]
This error is caused by that pulling image fails from registry.k8s.io on some node, replace the image repository to an accessible on in china:
$ sudo gedit /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"
[ERROR: unable to create new content in namespace xxx because it is being terminated.]
kubectl apply -f xxx.yaml
unable to create new content in namespace xxx because it is being terminated.
[SOLUTION]
This error is caused by deleting a namespace before deleting the resources in it via kubectl delete -f xxx.yaml
. The solution is :
step 1:
$ kubectl get namespace <terminating-namespace> -o json > /home/maye/maye.json
step2: $ sudo gedit maye.json
delete the field "finalizers", e.g:
# in file maye.json
finalizers: ["finalizers.knative-serving.io/namespaces"]
delete the whole key-value map, usually filed "finalizers" in "metadata" or "spec" or both, delete all "finalizers" fields.
step 3: replace the existing resource (the terminating namespace in this example) with the one defined in the file maye.json:
$ kubectl replace --raw "/api/v1/namespaces/<terminating-namespace>/finalize" -f /home/maye/maye.json
[ERROR: pods of flannel not ready: open /run/flannel/subnet.env: no such file or directory]
$ journalctl -fu kubelet
"loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory"
[SOLUTION]
step 1: clean up the cluster:
$ kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets
$ kubeadm reset
step 2: init the cluster again with pod-network-cidr specified:
$ kubeadm init --image-repository=registry.aliyuncs.com/google_containers --pod-network-cidr=10.244.0.0/16
step 3: deploy fannel:
(base) maye@maye-Inspiron-5547:~$ kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml
[ERROR: port 10250 is in use]
root@maye-laptop:/home/maye# kubeadm join --token ercj5r.8i8sspccfgpx1z3q 192.168.0.104:6443 --discovery-token-ca-cert-hash sha256:75a8426aebf9dc7b52c6b36bf281435aa0fe0064e94947154825eaa87ca11ab0
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists
[ERROR Port-10250]: Port 10250 is in use
[ERROR FileAvailable--etc-kubernetes-pki-ca.crt]: /etc/kubernetes/pki/ca.crt already exists
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack trace of this error execute with --v=5 or higher
[SOLUTION]
# solution for [ERROR FileAvailable--etc-kubernetes-kubelet.conf]: # /etc/kubernetes/kubelet.conf already exists
root@maye-laptop:/home/maye# rm /etc/kubernetes/kubelet.conf
root@maye-laptop:/home/maye# rm /etc/kubernetes/pki/ca.crt
# solution for [ERROR Port-10250]: Port 10250 is in use
# check which process is using the pod:
root@maye-laptop:/home/maye# sudo lsof -i:10250
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kubelet 24207 root 23u IPv6 451262 0t0 TCP *:10250 (LISTEN)
# reset the node
root@maye-laptop:/home/maye# kubeadm reset
# clean up custom iptables on the node
root@maye-laptop:/home/maye# iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
# join the cluster again
root@maye-laptop:/home/maye# kubeadm join --token ercj5r.8i8sspccfgpx1z3q 192.168.0.104:6443 --discovery-token-ca-cert-hash sha256:75a8426aebf9dc7b52c6b36bf281435aa0fe0064e94947154825eaa87ca11ab0
[ERROR ErrImagePull: dial tcp: lookup registry.aliyuncs.com on 127.0.0.53:53: read udp 127.0.0.1:33252->127.0.0.53:53: i/o timeout]
kubectl get pod --all-namespaces
kube-system kube-proxy-v955q 0/1 ErrImagePull
(base) maye@maye-Inspiron-5547:~$ kubectl describe pod kube-proxy-v955q -n kube-system
Name: kube-proxy-v955q
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: kube-proxy
Node: maye-laptop/192.168.0.102
Start Time: Mon, 08 Jan 2024 17:32:09 +0800
Labels: controller-revision-hash=85b545b64f
k8s-app=kube-proxy
pod-template-generation=1
Annotations:
Status: Running
IP: 192.168.0.102
IPs:
IP: 192.168.0.102
Controlled By: DaemonSet/kube-proxy
Containers:
kube-proxy:
Container ID: containerd://1195654b063b604a9ece797d141b185fc8714c74f7d628b06cbb5a83aca6af9e
Image: registry.aliyuncs.com/google_containers/kube-proxy:v1.26.12
Image ID: registry.aliyuncs.com/google_containers/kube-proxy@sha256:cf83e7ff3ae5565370b6b0e9cfa6233b27eb6113a484a0074146b1bbb0bd54e3
Port:
Host Port:
...
Events:
Type Reason Age From Message
Normal Scheduled 3m11s default-scheduler Successfully assigned kube-system/kube-proxy-v955q to maye-laptop
Warning Failed 2m55s kubelet Failed to pull image "registry.aliyuncs.com/google_containers/kube-proxy:v1.26.12": rpc error: code = Unknown desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-proxy:v1.26.12": failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://registry.aliyuncs.com/v2/google_containers/kube-proxy/manifests/sha256:cf83e7ff3ae5565370b6b0e9cfa6233b27eb6113a484a0074146b1bbb0bd54e3": dial tcp: lookup registry.aliyuncs.com on 127.0.0.53:53: read udp 127.0.0.1:45655->127.0.0.53:53: i/o timeout
Warning Failed 99s (x2 over 2m55s) kubelet Error: ErrImagePull
Warning Failed 99s kubelet Failed to pull image "registry.aliyuncs.com/google_containers/kube-proxy:v1.26.12": rpc error: code = Unknown desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-proxy:v1.26.12": failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://registry.aliyuncs.com/v2/google_containers/kube-proxy/manifests/sha256:cf83e7ff3ae5565370b6b0e9cfa6233b27eb6113a484a0074146b1bbb0bd54e3": dial tcp: lookup registry.aliyuncs.com on 127.0.0.53:53: read udp 127.0.0.1:33252->127.0.0.53:53: i/o timeout
Normal BackOff 84s (x2 over 2m55s) kubelet Back-off pulling image "registry.aliyuncs.com/google_containers/kube-proxy:v1.26.12"
Warning Failed 84s (x2 over 2m55s) kubelet Error: ImagePullBackOff
Normal Pulling 70s (x3 over 3m8s) kubelet Pulling image "registry.aliyuncs.com/google_containers/kube-proxy:v1.26.12"
Normal Pulled 46s kubelet Successfully pulled image "registry.aliyuncs.com/google_containers/kube-proxy:v1.26.12" in 23.881696233s (23.881733738s including waiting)
Normal Created 45s kubelet Created container kube-proxy
Normal Started 45s kubelet Started container kube-proxy
[SOLUTION]
ping registry.aliyuncs.com
on the host, if ok, then this is caused by temporary not good network, it does not matter, since kubelet will Back-off pulling image until Successfully pulled image, so just wait for a while.
标签:maye,kubernetes,cluster,--,containerd,kubelet,Start,io,kube From: https://www.cnblogs.com/zhenxia-jiuyou/p/18012543