当 TCP 收到乱序数据包时,它会立即发送一个 SACK(选择确认)数据包,这会产生网络负载,同时迫使接收方发送 1-MSS 的病态数据包,增加其重传(RTX)队列的长度/深度,从而增加处理时间。
WiFi 网络会受到这种激进行为的影响,但一般来说,当网络拥塞时,这些 SACK 数据包会火上浇油,使情况更加糟糕。
这个补丁增加了一个高分辨率定时器和 tp->compressed_ack 计数器取代发送 SACK;如果在定时器尚未到期时需要发送后续的 SACK,我们只需增加 tp->compressed_ack 计数;
当定时器到期时,发送一个带有最新信息的 SACK。每当发送 ACK(如果数据被发送,或者如果接收到有序数据)时,定时器被取消。
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
注意,如果 SACK 块需要重新排列,tcp_sack_new_ofo_skb() 能够强制发送 SACK--> 即使定时器尚未到期,
/*
* Check if sending an ack is needed.
*/
static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
{
struct tcp_sock *tp = tcp_sk(sk);
unsigned long rtt, delay;
/* More than one full frame received... */
if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
/* ... and right edge of window advances far enough.
* (tcp_recvmsg() will send ACK otherwise).
* If application uses SO_RCVLOWAT, we want send ack now if
* we have not received enough bytes to satisfy the condition.
*/
(tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
__tcp_select_window(sk) >= tp->rcv_wnd)) ||
/* We ACK each frame or... */
tcp_in_quickack_mode(sk) ||
/* Protocol state mandates a one-time immediate ACK */
inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOW) {
send_now:
tcp_send_ack(sk);
return;
}
//不可能存在乱序数据包 ||||或者乱序队列为空的情况
if (!ofo_possible || RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
tcp_send_delayed_ack(sk);
return;
}
/*
Is there a particular motivation for the cap of 127? IMHO 127 ACKs is quite
a few to compress. Experience seems to show that it works well to have one
GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It might
be nice to try to match those dynamics in this SACK compression case, so it
might be nice to cap the number of compressed ACKs at something like 44?
(0xffff / 1448 - 1). That way for high-speed paths we could try to keep
the ACK clock going with ACKs for ~64KBytes that trigger a single TSO skb
of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs.
sysctl_tcp_comp_sack_nr = 44
*///对于不支持SACK的连接,或者compressed_ack数量大于等于sysctl_tcp_comp_sack_nr(默认44)的情况,立即发送ACK报文
if (!tcp_is_sack(tp) ||
tp->compressed_ack >= sock_net(sk)->ipv4.sysctl_tcp_comp_sack_nr)
goto send_now;
if (tp->compressed_ack_rcv_nxt != tp->rcv_nxt) {
tp->compressed_ack_rcv_nxt = tp->rcv_nxt;
tp->dup_ack_counter = 0;
}
/*
In commit 86de5921a3d5 ("tcp: defer SACK compression after DupThresh")
I added a TCP_FASTRETRANS_THRESH bias to tp->compressed_ack in order
to enable sack compression only after 3 dupacks.
*/
if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
tp->dup_ack_counter++;
goto send_now;//???立即发送ACK报文
}
tp->compressed_ack++;
if (hrtimer_is_queued(&tp->compressed_ack_timer))
return;
/* compress ack timer : 5 % of rtt, but no more than tcp_comp_sack_delay_ns
compressed_ack_timer定时器,时长为RTT的二十分之一,
但是不超过sysctl_tcp_comp_sack_delay_ns(默认1ms)值*/
rtt = tp->rcv_rtt_est.rtt_us;
if (tp->srtt_us && tp->srtt_us < rtt)
rtt = tp->srtt_us;
delay = min_t(unsigned long, sock_net(sk)->ipv4.sysctl_tcp_comp_sack_delay_ns,
rtt * (NSEC_PER_USEC >> 3)/20);
sock_hold(sk);
hrtimer_start_range_ns(&tp->compressed_ack_timer, ns_to_ktime(delay),
sock_net(sk)->ipv4.sysctl_tcp_comp_sack_slack_ns,
HRTIMER_MODE_REL_PINNED_SOFT);
}
static enum hrtimer_restart tcp_compressed_ack_kick(struct hrtimer *timer)
{
struct tcp_sock *tp = container_of(timer, struct tcp_sock, compressed_ack_timer);
struct sock *sk = (struct sock *)tp;
bh_lock_sock(sk);
if (!sock_owned_by_user(sk)) {
if (tp->compressed_ack) {//此处代码应该修改为if (tp->compressed_ack > TCP_FASTRETRANS_THRESH) 前3三次必定会出发sack为了快速恢复
/* Since we have to send one ack finally,
* substract one from tp->compressed_ack to keep
* LINUX_MIB_TCPACKCOMPRESSED accurate.
*/
tp->compressed_ack--;
tcp_send_ack(sk);
}
} else {
if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED,
&sk->sk_tsq_flags))
sock_hold(sk);
}
bh_unlock_sock(sk);
sock_put(sk);
return HRTIMER_NORESTART;
}
发送路径与Compressed-ACK
在发送函数__tcp_transmit_skb中,如果TCP设置了TCPHDR_ACK标志,调用tcp_event_ack_sent处理ACK相关操作。
如果Comp-ACK数量已经超过TCP_FASTRETRANS_THRESH,将compressed_ack设置为TCP_FASTRETRANS_THRESH,并尝试取消停止compressed_ack_timer定时器。
tcp: defer SACK compression after DupThresh
Jean-Louis reported a TCP regression and bisected to recent SACK
compression.
After a loss episode (receiver not able to keep up and dropping
packets because its backlog is full), linux TCP stack is sending
a single SACK (DUPACK).
Sender waits a full RTO timer before recovering losses.
While RFC 6675 says in section 5, "Algorithm Details",
(2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
indicating at least three segments have arrived above the current
cumulative acknowledgment point, which is taken to indicate loss
-- go to step (4).
...
(4) Invoke fast retransmit and enter loss recovery as follows:
there are old TCP stacks not implementing this strategy, and
still counting the dupacks before starting fast retransmit.
While these stacks probably perform poorly when receivers implement
LRO/GRO, we should be a little more gentle to them.
This patch makes sure we do not enable SACK compression unless
3 dupacks have been sent since last rcv_nxt update.
乱序报文与Compressed-ACK
对于接收到的乱序报文,函数tcp_sack_new_ofo_skb负责生成SACK序号块。对于当前SACK序号块数量为零的情况,直接添加即可。否则,遍历当前序号块数组,看一下乱序报文的起止序号能否与已有序号块进行扩展,将扩展之后的序号块移动到数组首部。并尝试进行序号块数组的合并。
/* Reasonable amount of sack blocks included in TCP SACK option
* The max is 4, but this becomes 3 if TCP timestamps are there.
* Given that SACK packets might be lost, be conservative and use 2.
*/
#define TCP_SACK_BLOCKS_EXPECTED 2
static void tcp_sack_new_ofo_skb(struct sock *sk, u32 seq, u32 end_seq)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_sack_block *sp = &tp->selective_acks[0];
int cur_sacks = tp->rx_opt.num_sacks;
int this_sack;
if (!cur_sacks)
goto new_sack;
for (this_sack = 0; this_sack < cur_sacks; this_sack++, sp++) {
if (tcp_sack_extend(sp, seq, end_seq)) {//尝试合并
if (this_sack >= TCP_SACK_BLOCKS_EXPECTED)
tcp_sack_compress_send_ack(sk);
/* Rotate this_sack to the first one. *///合并成功
for (; this_sack > 0; this_sack--, sp--) //把合并成功的sack移动selective_acks[0]
swap(*sp, *(sp - 1));
if (cur_sacks > 1)
tcp_sack_maybe_coalesce(tp); //尝试把后面的sack合并到selective_acks[0]
return;
}
}
/*
Currently, tcp_sack_new_ofo_skb() sends an ack if prior
acks were 'compressed', if room has to be made in tp->selective_acks[]
But there is no guarantee all four sack ranges can be included
in SACK option. As a matter of fact, when TCP timestamps option
is used, only three SACK ranges can be included.
Lets assume only two ranges can be included, and force the ack:
- When we touch more than 2 ranges in the reordering
done if tcp_sack_extend() could be done.
- If we have at least 2 ranges when adding a new one.
*/
if (this_sack >= TCP_SACK_BLOCKS_EXPECTED)
tcp_sack_compress_send_ack(sk);
//没有找到能合并的sack
/* Could not find an adjacent existing SACK, build a new one,
* put it at the front, and shift everyone else down. We
* always know there is at least one SACK present already here.
*
* If the sack array is full, forget about the last one.
*/
if (this_sack >= TCP_NUM_SACKS) { //已经到最大值了
this_sack--;
tp->rx_opt.num_sacks--;
sp--;
} //新的sack放到最前面的位置 本端将在不发送ACK的情况下,丢弃SACK序号块。有可能导致对端对此段数据的重传。
for (; this_sack > 0; this_sack--, sp--)
*sp = *(sp - 1);
new_sack:
/* Build the new head SACK, and we're done. */
sp->start_seq = seq;
sp->end_seq = end_seq;
tp->rx_opt.num_sacks++;
}
见:https://lore.kernel.org/all/20200430.132433.1100258513284854034.davem@davemloft.net/T/#ma539ba1d99ae941d2ae788a3f3617ff69f7d8d0e
标签:SACK,compression,ack,压缩,sack,sk,tp,tcp From: https://www.cnblogs.com/codestack/p/18229529