K3S 系列文章-5G IoT 网关设备 POD 访问报错 DNS 'i/o timeout'分析与解决

标签：网关 timeout -- 8.8 resolv dnsmasq conf 报错 DNS

开篇

《K3s 系列文章》
《Rancher 系列文章》

问题概述

20220606 5G IoT 网关设备同时安装 K3S Server, 但是 POD 却无法访问互联网地址，查看 CoreDNS 日志提示如下：

...
[ERROR] plugin/errors: 2 update.traefik.io. A: read udp 10.42.0.3:38545->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 update.traefik.io. AAAA: read udp 10.42.0.3:38990->8.8.8.8:53: i/o timeout
...

即 DNS 查询 forward 到了 8.8.8.8 这个 DNS 服务器，并且查询超时。

从而导致需要联网启动的 POD 无法正常启动，频繁 CrashLoopBackoff, 如下图：

需要联网启动的 POD CrashLoopBackoff

但是通过 Node 直接访问，却是可以正常访问的，如下图：

通过 Node 可以正常访问

环境信息

硬件：5G IoT 网关
网络：
1. 互联网访问：5G 网卡：就是一个 usb 网卡，需要通过拨号程序启动，程序会调用系统的 dhcp/dnsmasq/resolvconf 等
2. 内网访问：wlan 网卡
软件：K3S Server v1.21.7+k3s1, dnsmasq 等

分析

网络详细配置信息

一步一步检查分析：

看 /etc/resolv.conf, 发现配置是 127.0.0.1
netstat 查看本地 53 端口确实在运行
这种情况一般都是本地启动了 DNS 服务器或缓存，查看 dnsmasq 进程是否存在，确实存在
dnsmasq 用的 resolv.conf 配置是 /run/dnsmasq/resolv.conf

$ cat /etc/resolv.conf
# Generated by resolvconf
nameserver 127.0.0.1

$ netstat -anpl|grep 53
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      -
tcp6       0      0 :::53                   :::*                    LISTEN      -
udp        0      0 0.0.0.0:53              0.0.0.0:*                           -
udp6       0      0 :::53                   :::*                                -

$ ps -ef|grep dnsmasq
dnsmasq    912     1  0 6 月 06 ?       00:00:00 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -r /run/dnsmasq/resolv.conf -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=...

$ systemctl status dnsmasq.service
● dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
   Loaded: loaded (/lib/systemd/system/dnsmasq.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2022-06-06 17:21:31 CST; 16h ago
 Main PID: 912 (dnsmasq)
    Tasks: 1 (limit: 4242)
   Memory: 1.1M
   CGroup: /system.slice/dnsmasq.service
           └─912 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -r /run/dnsmasq/resolv.conf -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=...

6 月 06 17:21:31 orangebox-7eb3 dnsmasq[912]: started, version 2.80 cachesize 150
6 月 06 17:21:31 orangebox-7eb3 dnsmasq[912]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify dumpfile
6 月 06 17:21:31 orangebox-7eb3 dnsmasq-dhcp[912]: DHCP, IP range 192.168.51.100 -- 192.168.51.200, lease time 3d
6 月 06 17:21:31 orangebox-7eb3 dnsmasq[912]: read /etc/hosts - 8 addresses
6 月 06 17:21:31 orangebox-7eb3 dnsmasq[912]: no servers found in /run/dnsmasq/resolv.conf, will retry
6 月 06 17:21:31 orangebox-7eb3 dnsmasq[928]: Too few arguments.
6 月 06 17:21:31 orangebox-7eb3 systemd[1]: Started dnsmasq - A lightweight DHCP and caching DNS server.
6 月 06 17:22:18 orangebox-7eb3 dnsmasq[912]: reading /run/dnsmasq/resolv.conf
6 月 06 17:22:18 orangebox-7eb3 dnsmasq[912]: using nameserver 222.66.251.8#53
6 月 06 17:22:18 orangebox-7eb3 dnsmasq[912]: using nameserver 116.236.159.8#53

$ cat /run/dnsmasq/resolv.conf
# Generated by resolvconf
nameserver 222.66.251.8
nameserver 116.236.159.8

CoreDNS 分析

这里很奇怪，就是没发现哪儿有配置 DNS 8.8.8.8, 但是日志中却显示指向了这个 DNS:

[ERROR] plugin/errors: 2 update.traefik.io. A: read udp 10.42.0.3:38545->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 update.traefik.io. AAAA: read udp 10.42.0.3:38990->8.8.8.8:53: i/o timeout

先查看一下 CoreDNS 的配置：(K3S 的 CoreDNS 是通过 manifests 启动的，位于：/var/lib/rancher/k3s/server/manifests/coredns.yaml 下）

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        hosts /etc/coredns/NodeHosts {
          ttl 60
          reload 15s
          fallthrough
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

这里主要有 2 个配置需要关注：

forward . /etc/resolv.conf
loop

CoreDNS 问题常用分析流程

检查 DNS Pod 是否正常运行 - 结果：是的；

# kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-7448499f4d-pbxk6   1/1     Running   1          15h

检查 DNS 服务是否存在正确的 cluster-ip - 结果：是的：

# kubectl -n kube-system get svc -l k8s-app=kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP,9153/TCP   15h

检查是否能解析域名：
先是内部域名 - 结果：无法解析：

$ kubectl run -it --rm --restart=Never busybox --image=busybox -- nslookup kubernetes.default
Server:         10.43.0.10
Address:        10.43.0.10:53

;; connection timed out; no servers could be reached

再试外部域名 - 结果：无法解析：

$ kubectl run -it --rm --restart=Never busybox --image=busybox -- nslookup www.baidu.com
Server:         10.43.0.10
Address:        10.43.0.10:53

;; connection timed out; no servers could be reached

检查 resolv.conf 中的 nameserver 配置 - 结果：确实是 8.8.8.8

$ kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'
nameserver 8.8.8.8
pod "test-7517" deleted

综上：
应该是 POD 内的 /etc/resolv.conf 被配置为 nameserver 8.8.8.8 导致了此次问题。
但是整个 Node OS 级别并没有配置 nameserver 8.8.8.8, 所以怀疑是：Kubernetes、Kubelet、CoreDNS 或 CRI 层面有这样的机制：在 DNS 配置异常时，自动配置其为 nameserver 8.8.8.8。

那么，要解决问题，还是要找到 DNS 配置异常。

容器网络 DNS 服务

我这里暂时没有查到 Kubernetes、Kubelet、CoreDNS 或 CRI 的相关 DNS 的具体证据，K3S 的 CRI 是 containerd，但是我在 Docker 的官方文档看到了这样的描述：

标签：网关,timeout,--,8.8,resolv,dnsmasq,conf,报错,DNS
From： https://www.cnblogs.com/east4ming/p/17131921.html