- Institute:University of Illinois at Urbana-Champaign
- Author:Jiahui Yu, Thomas Huang
- GitHub:https://github. com/JiahuiYu/slimmable_networks
Introduction
最初的Slimmable networks基于预定义的width set切换网络宽度
=> Motivation:can a single neural network run at arbitrary width?作者认为更宽的网络性能不会差于他的slim子网,并且残余误差会处于以下不等式的范围中:
其中是指前k个通道聚集的结果,是固定的超参数。
Challenges:
First, how to deal with neural networks with batch normalization?(BN设计)
Second, how to train US-Nets efficiently?(Train)
Third, compared with training individual networks, what else can we explore in US-Nets to improve overall performance?
在BN层上存在问题
First, accumulating independent BN statistics of all sub-networks in a US-Net during training is computationally intensive and inefficient.
Second, if in each iteration we only update some sampled sub-networks, then these BN statistics are insufficiently accumulated thus inaccurate, leading to much worse accuracy in our experiments.
对于上述问题,作者的贡献有以下几点:
(1)训练了可在任意宽度执行的网络
(2)提出了两个训练技术(the sandwich rule和inplace distillation)
(3)在图像分类,超分,强化学习进行实验和消融实验
(4)进一步研究了网络的几个参数:宽度下界K0,宽度除数d,采样宽度的数量,BN后统计子集的大小
(5)进一步提出每层可以采用单独的宽度比值
(6)为后续工作铺垫(one-shot architecture search)
Related Work
Slimmable Networks.
Knowledge Distilling:Transfer the learned knowledge from a pretrained network to a new one by training it with predicted features, soft-targets or both.
Method
输出单元特征聚集:其中,n是通道数。fully aggregated feature yn和partially aggregated feature yk 的residual error δ :
公式(3)说明了slimmable network可以在区间中任意宽度运行(US-Nets),并且概念上有界不等式适用于任何神经网络,与何种BN层无关。
在当前训练阶段BN层的标准化过程为
其中是防止除0的小数值, γ和β学习的尺度和偏置。全局的特征平均值和方差采用移动平均的方法进行更新
其中m是动量,t是训练迭代。第T次迭代推理阶段则采用这些全局统计信息。
其中γ∗ ,β∗是优化的参数。公式(6)可进一步表示为
除了公式(5)移动平均来计算统计信息外还可以采用精确平均:
实际上作者的做法就是对每个宽度在训练后计算BN统计信息(结果写了这么多?就当复习BN的理论了),因为训练集上随机采样的子集可以产生精确的统计信息的估计。
对于训练,作者提出了the sandwich rule和inplace distillation。
The sandwich rule:训练中,在width multiplier set[0.25, 1.0]×随机采样n-2个子网,然后加上最大和最小子网得到n个子网。采样的子网性能也处于0.25x和1.0x之间。
the sandwich rule展示了更好的收敛和全局表现性能,优点有:
1.训练大子网和小子网,观察他们的验证集误差,相当于得到了性能的上界和下界。
2.大子网的训练对inplace distillation是必要的。
Inplace Distillation:利用最大子网训练中的预测标签作为其他子网的训练标签,最大子网使用Ground truth。
图像分类:predicted soft-probabilities by largest width with cross entropy as objective function.
图像超分:predicted high-resolution patches are used as labels with either `1 or `2 as training objective.
强化学习:the policy predicted by the model at largest width as roll-outs
作者还尝试了将预测标签和GT标签结合作为子网的训练标签,但是效果差。
训练过程:
Experiments
ImageNet Classification:
Image Super-Resolution.
作者认为使用了独立模型最优的参数而不是US-Nets最优导致性能变差。
Deep Reinforcement Learning
Ablation Study
The Sandwich Rule:
the sandwich rule has better performance on average, with good accuracy at both smallest width and largest width.
训练小的比训练大的更重要.
Inplace Distillation:
Post-Statistics of Batch Normalization:
Width Lower Bound
Width Divisor d :MobileNets中floor the channel number approximately as
Number of Sampled Widths Per Iteration n
评价:本文是针对ICLR19的slimmable中fixed width的坑,更多是说明网络可以根据任意宽度执行。the sandwich rule和inplace distillation给后续工作提供一个训练的思路。
标签:Slimmable,Training,子网,Improved,训练,BN,US,width From: https://www.cnblogs.com/huang-hz/p/16647511.html