intro
作为一个分布式虚拟化系统,网络在k8s中有重要意义。不同node上pod如何基于网络进行通讯是一个需要解决的基本/重要问题。在k8s的Networking and Network Policy中提到了常用的网络策略。其中的列表显然是按照字典序(而不是使用频率)排列,其中提到了比较常用的flannel模型,这个模型也是实践中比较常见的一个k8s网络模型。
Flannel is an overlay network provider that can be used with Kubernetes.
在flannel的主页中描述了flannel使用的主要技术VXLAN:
Flannel runs a small, single binary agent called flanneld on each host, and is responsible for allocating a subnet lease to each host out of a larger, preconfigured address space. Flannel uses either the Kubernetes API or etcd directly to store the network configuration, the allocated subnets, and any auxiliary data (such as the host's public IP). Packets are forwarded using one of several backend mechanisms including VXLAN and various cloud integrations.
这个vxlan之前并没有听说过,所以按照常识首先可以确认下这个网络协议的报文格式:VXLAN packet format。
- 8-byte VXLAN header—VXLAN information for the frame.
Flags—If the I bit is 1, the VXLAN ID is valid. If the I bit is 0, the VXLAN ID is invalid. All other bits are reserved and set to 0.
24-bit VXLAN ID—Identifies the VXLAN of the frame. It is also called the virtual network identifier (VNI).- 8-byte outer UDP header for VXLAN—The default VXLAN destination UDP port number is 4789.
- 20-byte outer IP header—Valid addresses of VTEPs or VXLAN multicast groups on the transport network. Devices in the transport network forward VXLAN packets based on the outer IP header.
这个报文格式有一个比较“奇特”的地方:在UDP协议内部直接包含了隧道(tunnel)中的数据,并没有任何一个类似于类型的字段说明这是一个vxlan报文。从网上查找资料可以看到:这个协议并不是在报文中通过特定的数值字段来表示它是一个vxlan报文,而是通过发送到特定UDP端口的数据都认为是vxlan报文。
The destination UDP port in the outer UDP header is specified in the VXLAN specification (Port 4789). This means it is a well-known service. So an UDP packet that arrives on Port 4789 is expected to be a VXLAN packet¹ in the same way that a TCP packet that arrives on Port 80 is expected to be a HTTP packet¹.
The draft you linked to is outdated and is missing this port number (although it mentions that the port number is to be obtained from IANA).
¹) When I talk about VXLAN/HTTP packets I mean of course the respective UDP/TCP packets with VXLAN/HTTP header/protocol inside.
看起来这个UDP端口是系统级别的,也就是内核需要感知这个UDP端口,进而需要有一个对应的socket实例。我们知道:通常socket都是用户态进程创建的,当有报文到达UDP的socket之后,此时需要唤醒侦听这个socket的进程。
那么问题终于来了:如果这个socket是内核创建的话,当有报文(packet)到达这个socket的时候,需要唤醒哪个进程来处理呢?
UDP socket
在内核中,当从一个udp socket中接收到数据时,首先会判断这个socket是否设置了encapsulation字段,如果有的话就不再走常规的socket报文接收、进程唤醒逻辑,而是调用注册的接收函数(encap_rcv)。
udp_rcv>>__udp4_lib_rcv>>udp_unicast_rcv_skb>>udp_queue_rcv_skb>>udp_queue_rcv_one_skb>>>>
/* returns:
* -1: error
* 0: success
* >0: "udp encap" protocol resubmission
*
* Note that in the success and error cases, the skb is assumed to
* have either been requeued or freed.
*/
static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
{
int drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
struct udp_sock *up = udp_sk(sk);
int is_udplite = IS_UDPLITE(sk);
/*
* Charge it to the socket, dropping if the queue is full.
*/
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) {
drop_reason = SKB_DROP_REASON_XFRM_POLICY;
goto drop;
}
nf_reset_ct(skb);
if (static_branch_unlikely(&udp_encap_needed_key) &&
READ_ONCE(up->encap_type)) {
int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
/*
* This is an encapsulation socket so pass the skb to
* the socket's udp_encap_rcv() hook. Otherwise, just
* fall through and pass this up the UDP socket.
* up->encap_rcv() returns the following value:
* =0 if skb was successfully passed to the encap
* handler or was discarded by it.
* >0 if skb should be passed on to UDP.
* <0 if skb should be resubmitted as proto -N
*/
/* if we're overly short, let UDP handle it */
encap_rcv = READ_ONCE(up->encap_rcv);
if (encap_rcv) {
int ret;
/* Verify checksum before giving to encap */
if (udp_lib_checksum_complete(skb))
goto csum_error;
ret = encap_rcv(sk, skb);
if (ret <= 0) {
__UDP_INC_STATS(sock_net(sk),
UDP_MIB_INDATAGRAMS,
is_udplite);
return -ret;
}
}
/* FALLTHROUGH -- it's a UDP Packet */
}
对应的,在udp的socket结构中定义了针对encapsulation类型socket的特定函数指针。
struct udp_sock {
///...
/*
* For encapsulation sockets.
*/
int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
void (*encap_err_rcv)(struct sock *sk, struct sk_buff *skb, int err,
__be16 port, u32 info, u8 *payload);
int (*encap_err_lookup)(struct sock *sk, struct sk_buff *skb);
void (*encap_destroy)(struct sock *sk);
///...
};
vxlan
注册
vxlan启动时,在创建的socket中注册了encap_rcv函数为vxlan_rcv函数。
/* Create new listen socket if needed */
static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
__be16 port, u32 flags,
int ifindex)
{
///...
/* Mark socket as an encapsulation socket. */
memset(&tunnel_cfg, 0, sizeof(tunnel_cfg));
tunnel_cfg.sk_user_data = vs;
tunnel_cfg.encap_type = 1;
tunnel_cfg.encap_rcv = vxlan_rcv;
tunnel_cfg.encap_err_lookup = vxlan_err_lookup;
tunnel_cfg.encap_destroy = NULL;
///...
}
回调
对应的vxlan_rcv函数主体就是对封装(encapsulation)报文进行解包,然后调用gro_cells_receive函数。
/* Callback from net/ipv4/udp.c to receive packets */
static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
{
struct vxlan_vni_node *vninode = NULL;
struct vxlan_dev *vxlan;
struct vxlan_sock *vs;
struct vxlanhdr unparsed;
struct vxlan_metadata _md;
struct vxlan_metadata *md = &_md;
__be16 protocol = htons(ETH_P_TEB);
bool raw_proto = false;
void *oiph;
__be32 vni = 0;
int nh;
/* Need UDP and VXLAN header to be present */
if (!pskb_may_pull(skb, VXLAN_HLEN))
goto drop;
unparsed = *vxlan_hdr(skb);
/* VNI flag always required to be set */
if (!(unparsed.vx_flags & VXLAN_HF_VNI)) {
netdev_dbg(skb->dev, "invalid vxlan flags=%#x vni=%#x\n",
ntohl(vxlan_hdr(skb)->vx_flags),
ntohl(vxlan_hdr(skb)->vx_vni));
/* Return non vxlan pkt */
goto drop;
}
unparsed.vx_flags &= ~VXLAN_HF_VNI;
unparsed.vx_vni &= ~VXLAN_VNI_MASK;
vs = rcu_dereference_sk_user_data(sk);
if (!vs)
goto drop;
vni = vxlan_vni(vxlan_hdr(skb)->vx_vni);
vxlan = vxlan_vs_find_vni(vs, skb->dev->ifindex, vni, &vninode);
if (!vxlan)
goto drop;
/* For backwards compatibility, only allow reserved fields to be
* used by VXLAN extensions if explicitly requested.
*/
if (vs->flags & VXLAN_F_GPE) {
if (!vxlan_parse_gpe_proto(&unparsed, &protocol))
goto drop;
unparsed.vx_flags &= ~VXLAN_GPE_USED_BITS;
raw_proto = true;
}
if (__iptunnel_pull_header(skb, VXLAN_HLEN, protocol, raw_proto,
!net_eq(vxlan->net, dev_net(vxlan->dev))))
goto drop;
if (vs->flags & VXLAN_F_REMCSUM_RX)
if (unlikely(!vxlan_remcsum(&unparsed, skb, vs->flags)))
goto drop;
if (vxlan_collect_metadata(vs)) {
IP_TUNNEL_DECLARE_FLAGS(flags) = { };
struct metadata_dst *tun_dst;
__set_bit(IP_TUNNEL_KEY_BIT, flags);
tun_dst = udp_tun_rx_dst(skb, vxlan_get_sk_family(vs), flags,
key32_to_tunnel_id(vni), sizeof(*md));
if (!tun_dst)
goto drop;
md = ip_tunnel_info_opts(&tun_dst->u.tun_info);
skb_dst_set(skb, (struct dst_entry *)tun_dst);
} else {
memset(md, 0, sizeof(*md));
}
if (vs->flags & VXLAN_F_GBP)
vxlan_parse_gbp_hdr(&unparsed, skb, vs->flags, md);
/* Note that GBP and GPE can never be active together. This is
* ensured in vxlan_dev_configure.
*/
if (unparsed.vx_flags || unparsed.vx_vni) {
/* If there are any unprocessed flags remaining treat
* this as a malformed packet. This behavior diverges from
* VXLAN RFC (RFC7348) which stipulates that bits in reserved
* in reserved fields are to be ignored. The approach here
* maintains compatibility with previous stack code, and also
* is more robust and provides a little more security in
* adding extensions to VXLAN.
*/
goto drop;
}
if (!raw_proto) {
if (!vxlan_set_mac(vxlan, vs, skb, vni))
goto drop;
} else {
skb_reset_mac_header(skb);
skb->dev = vxlan->dev;
skb->pkt_type = PACKET_HOST;
}
/* Save offset of outer header relative to skb->head,
* because we are going to reset the network header to the inner header
* and might change skb->head.
*/
nh = skb_network_header(skb) - skb->head;
skb_reset_network_header(skb);
if (!pskb_inet_may_pull(skb)) {
DEV_STATS_INC(vxlan->dev, rx_length_errors);
DEV_STATS_INC(vxlan->dev, rx_errors);
vxlan_vnifilter_count(vxlan, vni, vninode,
VXLAN_VNI_STATS_RX_ERRORS, 0);
goto drop;
}
/* Get the outer header. */
oiph = skb->head + nh;
if (!vxlan_ecn_decapsulate(vs, oiph, skb)) {
DEV_STATS_INC(vxlan->dev, rx_frame_errors);
DEV_STATS_INC(vxlan->dev, rx_errors);
vxlan_vnifilter_count(vxlan, vni, vninode,
VXLAN_VNI_STATS_RX_ERRORS, 0);
goto drop;
}
rcu_read_lock();
if (unlikely(!(vxlan->dev->flags & IFF_UP))) {
rcu_read_unlock();
dev_core_stats_rx_dropped_inc(vxlan->dev);
vxlan_vnifilter_count(vxlan, vni, vninode,
VXLAN_VNI_STATS_RX_DROPS, 0);
goto drop;
}
dev_sw_netstats_rx_add(vxlan->dev, skb->len);
vxlan_vnifilter_count(vxlan, vni, vninode, VXLAN_VNI_STATS_RX, skb->len);
gro_cells_receive(&vxlan->gro_cells, skb);
rcu_read_unlock();
return 0;
drop:
/* Consume bad packet */
kfree_skb(skb);
return 0;
}
gro_cells_receive的主体逻辑是将报文放到一个napi_skbs队列的最后(__skb_queue_tail(&cell->napi_skbs, skb)),如果需要的话尝试进行报文的调度(napi_schedule(&cell->napi))。
struct gro_cell {
struct sk_buff_head napi_skbs;
struct napi_struct napi;
};
int gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
struct gro_cell *cell;
int res;
rcu_read_lock();
if (unlikely(!(dev->flags & IFF_UP)))
goto drop;
if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev)) {
res = netif_rx(skb);
goto unlock;
}
cell = this_cpu_ptr(gcells->cells);
if (skb_queue_len(&cell->napi_skbs) > READ_ONCE(net_hotdata.max_backlog)) {
drop:
dev_core_stats_rx_dropped_inc(dev);
kfree_skb(skb);
res = NET_RX_DROP;
goto unlock;
}
__skb_queue_tail(&cell->napi_skbs, skb);
if (skb_queue_len(&cell->napi_skbs) == 1)
napi_schedule(&cell->napi);
res = NET_RX_SUCCESS;
unlock:
rcu_read_unlock();
return res;
}
至于内核中的napi机制这里就不再分析,如果感兴趣的话可以内核文档或者linuxfoundation文档。
这里只需要知道:走到napi这个流程之后,这个packet和从物理网络设备接收到的报文走的是相同的流程了。
flanneld
当本地的数据到达flannel设备时,此时用户态运行的flanneld会根据自己从k8s中学习到的node ip和 cluster ip之间的对应关系,设置ip tunnel外部使用的(node)ip,从而最终完成虚拟网络到真实(node)网络之间的对接。
///@file: udp_network_amd64.go
func (n *network) Run(ctx context.Context) {
defer func() {
n.tun.Close()
n.conn.Close()
n.ctl.Close()
n.ctl2.Close()
}()
// one for each goroutine below
wg := sync.WaitGroup{}
defer wg.Wait()
wg.Add(1)
go func() {
runCProxy(n.tun, n.conn, n.ctl2, n.tunNet.IP, n.MTU())
wg.Done()
}()
从本地的路由信息中获得下一条的地址(也就是node的ip地址)。
///@file: proxy_amd64.c
static struct sockaddr_in *find_route(in_addr_t dst) {
size_t i;
for( i = 0; i < routes_cnt; i++ ) {
if( contains(routes[i].dst, dst) ) {
// packets for same dest tend to come in bursts. swap to front make it faster for subsequent ones
if( i != 0 ) {
struct route_entry tmp = routes[i];
routes[i] = routes[0];
routes[0] = tmp;
}
return &routes[0].next_hop;
}
}
return NULL;
}
验证
在k8s的节点中查看系统打开的udp端口,可以发现linux小胡总vxlan默认侦听的8472端口并没有对应的进程(因为它是一个内核创建的socket)。
tsecer@harry: sudo netstat -ulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 0.0.0.0:8472 0.0.0.0:* -
udp 0 0 127.0.0.54:53 0.0.0.0:* 906/systemd-resolve
udp 0 0 127.0.0.53:53 0.0.0.0:* 906/systemd-resolve
udp 0 0 0.0.0.0:36993 0.0.0.0:* 907/systemd-timesyn
tsecer@harry:
- master node
tsecer@harry: ip route
default via 172.16.0.1 dev eth0
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
172.16.0.0/16 dev eth0 proto kernel scope link src 172.16.0.2
tsecer@harry: ifconfig
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.0.1 netmask 255.255.255.0 broadcast 10.244.0.255
inet6 fe80::1ccb:adff:fed6:37e8 prefixlen 64 scopeid 0x20<link>
ether 1e:cb:ad:d6:37:e8 txqueuelen 1000 (Ethernet)
RX packets 545 bytes 47161 (47.1 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 550 bytes 78600 (78.6 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.0.2 netmask 255.255.0.0 broadcast 172.16.255.255
inet6 fe80::5815:f0ff:fe6b:7402 prefixlen 64 scopeid 0x20<link>
ether 76:52:14:85:92:e5 txqueuelen 1000 (Ethernet)
RX packets 11738 bytes 39007251 (39.0 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8660 bytes 1873516 (1.8 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.0.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::c83:dfff:fe06:e796 prefixlen 64 scopeid 0x20<link>
ether 0e:83:df:06:e7:96 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 12 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 64416 bytes 19403001 (19.4 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 64416 bytes 19403001 (19.4 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth32294f1c: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::286d:e5ff:fe73:d371 prefixlen 64 scopeid 0x20<link>
ether 2a:6d:e5:73:d3:71 txqueuelen 1000 (Ethernet)
RX packets 270 bytes 27128 (27.1 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 296 bytes 39663 (39.6 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth4f50c6e6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::601a:ebff:fe60:806b prefixlen 64 scopeid 0x20<link>
ether 62:1a:eb:60:80:6b txqueuelen 1000 (Ethernet)
RX packets 277 bytes 27747 (27.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 304 bytes 42785 (42.7 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tsecer@harry: ip -d link show flannel.1
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 0e:83:df:06:e7:96 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
vxlan id 1 local 172.16.0.2 dev eth0 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
tsecer@harry: bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:6b:74:02 dev eth0 self permanent
e2:4c:d4:4e:7e:63 dev flannel.1 dst 172.16.0.3 self permanent
3a:ae:e8:eb:08:4d dev flannel.1 dst 172.16.0.4 self permanent
33:33:00:00:00:01 dev cni0 self permanent
01:00:5e:00:00:6a dev cni0 self permanent
33:33:00:00:00:6a dev cni0 self permanent
01:00:5e:00:00:01 dev cni0 self permanent
33:33:ff:d6:37:e8 dev cni0 self permanent
1e:cb:ad:d6:37:e8 dev cni0 vlan 1 master cni0 permanent
1e:cb:ad:d6:37:e8 dev cni0 master cni0 permanent
72:e5:28:b2:78:a7 dev veth4f50c6e6 master cni0
62:1a:eb:60:80:6b dev veth4f50c6e6 vlan 1 master cni0 permanent
62:1a:eb:60:80:6b dev veth4f50c6e6 master cni0 permanent
33:33:00:00:00:01 dev veth4f50c6e6 self permanent
01:00:5e:00:00:01 dev veth4f50c6e6 self permanent
33:33:ff:60:80:6b dev veth4f50c6e6 self permanent
22:c5:ff:88:71:85 dev veth32294f1c master cni0
2a:6d:e5:73:d3:71 dev veth32294f1c vlan 1 master cni0 permanent
2a:6d:e5:73:d3:71 dev veth32294f1c master cni0 permanent
33:33:00:00:00:01 dev veth32294f1c self permanent
01:00:5e:00:00:01 dev veth32294f1c self permanent
33:33:ff:73:d3:71 dev veth32294f1c self permanent
tsecer@harry:
- node1
laborant@node-01:~$ PS1="tsecer@node1: "
tsecer@node1: ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.0.3 netmask 255.255.0.0 broadcast 172.16.255.255
inet6 fe80::d8d1:bcff:fe7c:ea17 prefixlen 64 scopeid 0x20<link>
ether 46:4f:70:5f:9a:23 txqueuelen 1000 (Ethernet)
RX packets 8623 bytes 38842461 (38.8 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 5142 bytes 610762 (610.7 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.1.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::e04c:d4ff:fe4e:7e63 prefixlen 64 scopeid 0x20<link>
ether e2:4c:d4:4e:7e:63 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 12 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 12 bytes 890 (890.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 12 bytes 890 (890.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tsecer@node1: ip route
default via 172.16.0.1 dev eth0
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
172.16.0.0/16 dev eth0 proto kernel scope link src 172.16.0.3
tsecer@node1: ip -d link show flannel.1
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether e2:4c:d4:4e:7e:63 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
vxlan id 1 local 172.16.0.3 dev eth0 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
tsecer@node1: bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:7c:ea:17 dev eth0 self permanent
3a:ae:e8:eb:08:4d dev flannel.1 dst 172.16.0.4 self permanent
0e:83:df:06:e7:96 dev flannel.1 dst 172.16.0.2 self permanent
tsecer@node1:
- node2
default via 172.16.0.1 dev eth0
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
172.16.0.0/16 dev eth0 proto kernel scope link src 172.16.0.4
tsecer@node2: ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.0.4 netmask 255.255.0.0 broadcast 172.16.255.255
inet6 fe80::828:47ff:fe7f:c3ba prefixlen 64 scopeid 0x20<link>
ether e6:6d:8d:55:fb:89 txqueuelen 1000 (Ethernet)
RX packets 8881 bytes 38925854 (38.9 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 5978 bytes 719168 (719.1 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.2.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::38ae:e8ff:feeb:84d prefixlen 64 scopeid 0x20<link>
ether 3a:ae:e8:eb:08:4d txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 12 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 12 bytes 890 (890.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 12 bytes 890 (890.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tsecer@node2: ip -d link show flannel.1
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 3a:ae:e8:eb:08:4d brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
vxlan id 1 local 172.16.0.4 dev eth0 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
tsecer@node2: bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:7f:c3:ba dev eth0 self permanent
e2:4c:d4:4e:7e:63 dev flannel.1 dst 172.16.0.3 self permanent
0e:83:df:06:e7:96 dev flannel.1 dst 172.16.0.2 self permanent
tsecer@node2:
outro
至此就可以回答最开始提出的问题了:当内核的这个UDP socket收到报文的时候,它并没有走常规的报文追加到socket接收队列/唤醒进程的流程。而是通过这个socket结构中设置的回调函数(encap_rcv)来直接调用函数(而不涉及进程)。
vxlan在创建这个socket的时候注册的回调函数为vxlan_rcv,这个函数对报文进行解包,然后追加到内核的napi框架中。这个napi框架再次对这个包进行解包并进行路由等一系列(和从eth网卡相同的)接收报文处理逻辑。此后,内核网络协议栈看到的解包后报文就是走出隧道(tunnel)的发送方承载报文了。
标签:00,socket,0.0,udp,dev,permanent,skb,vxlan From: https://www.cnblogs.com/tsecer/p/18656299