摘要:
VIT 是计算机视觉的开山之作,在不同的领域都表现了优越的性能.但是过大的计算量和内存需求,限制了其使用.该篇综述研究了并评估了四种模型压缩的方法,在VIT上的作用:量化,低秩分解,知识蒸馏,剪枝. 系统的分析并比较了这些方法在资源受限的环境下,优化VIT方面的效果.我们的压缩实验结果说明这些压缩方法促进了在模型体积压缩和精度做了很好的平衡.
相关工作:
- 量化: 量化已经在神经网络的性能提升上取得了里程碑式的效果. 量化就是将网络转为低比特,来降低对计算资源和内存资源的需求,而对模型精度基本不影响.量化的关键是确定权重的clipping的范围. Krishnmoorthi[15] 建议确定这个范围通过评估卷积层的所有滤波器的权重, Shen[16] 等人则使用分组量化的方法. 为了降低精度的损失,提出了QAT(Quantization-Aware Training) .QAT 方法是量化的模型使用浮点数进行标准的前向反向传播,然后梯度计算后,再次量化模型,因此可以保留量化的效果.
- 低秩分解: chen[17]强调VIT的注意力矩阵天生拥有低秩属性[18], 提供了降低复杂度的可能[19]. [20,21,22,23] 等方法使用低秩分解的方法. 而将低秩分解与稀疏注意力机制结合,会取得更好的效果[17]
- 知识蒸馏: Touvron[27] 引入了蒸馏token.这个token类似于类的token,但是专注于捕获教师网络的预测结果,通过自注意力参与蒸馏,优化蒸馏过程.这种蒸馏方法已经获得了可观的效果相对于卷积蒸馏.
- 剪枝: 由[29]Yang提出的维度重新分布的策略,被整合到剪枝过程中,进一步优化模型的性能.
实验:
对比了四种压缩方法在VIT模型上的压缩效果.主要观察的指标:模型大小和推理速度.
硬件条件: Tesla V100-SXM2 16GB GPU,
数据:CIFAR-10,CIFAR-100
不同模型压缩方法对比: 评估模型压缩方法对模型大小的影响,发现量化和剪枝能最大程度降低模型大小并最小的精度损失。特别的,量化方法,尤其是动态量化,有最优的效果,动态量化使用的是pytorch的量化API。
权重裁剪,尤其使用简单的重要性分数,在模型大小和精度方面没有取得很好的效果,对模型精度影响很大。剪枝率0.1(剪掉10%的参数),在两个数据及上都产生了严重的精度损失。
对于推理速度,知识蒸馏有尤其明显的效果。尤其,DeiT base模型,获取了接近两倍的加速而且精度几乎不受影响。
速度值是每秒的循环次数
CIFAR-10
Model | Method | Accuracy | GPU speed | CPU speed | Size(MB) |
Vanilla ViT [14] | 98.94 | 4.48 | 0.050 | 327 | |
Dynamic Quantization | PTQ | 98.73 | - | 0.062 | 84 |
FQ-ViT[32] | PTQ | 97.31 | - | - | - |
DIFFQ with LSQ[33] | QAT | 93.37 | 2.10 | - | 41 |
DIFFQ with diffq [34] | QAT | 60.29 | 12.20 | - | 2 |
DeiT base [27] | Knowledge Distillation | 98.47 | 7.04 | 0.096 | 327 |
DeiT tiny [27] | Knowledge Distillation | 95.43 | 16.78 | - | 21 |
ViT-Pruning(r=0.1)[28] | Pruning | 88.36 | 4.86 | - | 301 |
ViT-Pruning(r=0.2) [28] | Pruning | 80.56 | 5.54 | - | 254 |
ViT-Nystr(m=24) [21] | Low-rank Approximation | 65.91 | 4.67 | - | 327 |
ViT-Nystr(m=32) [21] | Low-rank Approximation | 75.94 | 4.57 | - | 327 |
ViT-Nystr(m=64) [21] | Low-rank Apporximation | 91.70 | 4.38 | - | 327 |
DeiT base +Dynamic Quantization | Knowledge Distillation+PTQ | 96.75 | - | 0.117 | 84 |
CIFAR-100
Model | Method | Accuracy | GPU speed | CPU speed | Size(MB) |
Vanilla ViT [14] | 92.87 | 4.34 | 0.093 | 327 | |
Dynamic Quantization | PTQ | 90.87 | - | 0.122 | 84 |
FQ-ViT[32] | PTQ | 84.87 | - | - | - |
DIFFQ with LSQ[33] | QAT | 76.08 | 2.10 | - | 41 |
DIFFQ with diffq [34] | QAT | 41.02 | 12.00 | - | 2 |
DeiT base [27] | Knowledge Distillation | 87.35 | 6.97 | 0.149 | 327 |
DeiT tiny [27] | Knowledge Distillation | 75。90 | 16.16 | - | 21 |
ViT-Pruning(r=0.1)[28] | Pruning | 74.46 | 4.69 | - | 302 |
ViT-Pruning(r=0.2) [28] | Pruning | 64.27 | 5.19 | - | 272 |
ViT-Nystr(m=24) [21] | Low-rank Approximation | 38.51 | 4.7 | - | 327 |
ViT-Nystr(m=32) [21] | Low-rank Approximation | 50.31 | 4.65 | - | 327 |
ViT-Nystr(m=64) [21] | Low-rank Apporximation | 74.01 | 4.46 | - | 327 |
DeiT base +Dynamic Quantization | Knowledge Distillation+PTQ | 82.61 | - | 0.196 | 84 |
[15] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A
whitepaper. arXiv 2018,” arXiv preprint arXiv:1806.08342, 1806.
[16] S. Shen et al., “Q-bert: Hessian based ultra low precision quantization of bert,” in
Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 8815–8821.
[17] B. Chen, T. Dao, E. Winsor, Z. Song, A. Rudra, and C. Ré, “Scatterbrain: Unifying
Sparse and Low-rank Attention Approximation.” arXiv, Oct. 28, 2021.
[18] F. Chen, N. Chen, H. Mao, and H. Hu, “The Application of Bipartite Matching in
Assignment Problem. arXiv 2019,” arXiv preprint arXiv:1902.00256.
[19] F. Chen, N. Chen, H. Mao, and H. Hu, “An efficient sorting algorithm-Ultimate
Heapsort (UHS). 2019.”
[20] J. Lu et al., “Soft: Softmax-free transformer with linear complexity,” Advances in
Neural Information Processing Systems, vol. 34, pp. 21297–21309, 2021.
[21] X. Yunyang et al., “Nyströmformer: A Nystöm-based Algorithm for Approximating
Self-Attention,” AAAI, 2021.
[22] K. Choromanski et al., “Rethinking Attention with Performers.” arXiv, Nov. 19, 2022.
doi: 10.48550/arXiv.2009.14794.
[23] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-Attention with
Linear Complexity.” arXiv, Jun. 14, 2020.
[24] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network.”
arXiv, Mar. 09, 2015.
[25] L. Yuan et al., “Tokens-to-token vit: Training vision transformers from scratch on
imagenet,” in Proceedings of the IEEE/CVF international conference on computer
vision, 2021, pp. 558–567.
[26] L. Wei, A. Xiao, L. Xie, X. Zhang, X. Chen, and Q. Tian, “Circumventing Outliers of
AutoAugment with Knowledge Distillation,” in Computer Vision – ECCV 2020, vol.
12348, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., in Lecture Notes in
Computer Science, vol. 12348. , Cham: Springer International Publishing, 2020, pp.
608–625. doi: 10.1007/978-3-030-58580-8_36.
[27] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training
data-efficient image transformers & distillation through attention,” in Proceedings of
the 38th International Conference on Machine Learning, PMLR, Jul. 2021, pp. 10347–
10357.
[28] M. Zhu, Y. Tang, and K. Han, “Vision Transformer Pruning.” arXiv, Aug. 14, 2021.
doi: 10.48550/arXiv.2104.08500.
[29] S. Yu et al., “Unified Visual Transformer Compression.” arXiv, Mar. 15, 2022. doi:
10.48550/arXiv.2203.08243.
[30] H. Yang, H. Yin, P. Molchanov, H. Li, and J. Kautz, “Nvit: Vision transformer
compression and parameter redistribution,” 2021.
[31] H. Yu and J. Wu, “A unified pruning framework for vision transformers,” Sci. China
Inf. Sci., vol. 66, no. 7, p. 179101, Jul. 2023, doi: 10.1007/s11432-022-3646-6.
[32] Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization
for vision transformer,” Advances in Neural Information Processing Systems, vol. 34,
pp. 28092–28103, 2021.
[33] Y. Lin, T. Zhang, P. Sun, Z. Li, and S. Zhou, “Fq-vit: Fully quantized vision
transformer without retraining,” arXiv preprint arXiv:2111.13824, 2021.
[34] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned
Step Size Quantization.” arXiv, May 06, 2020. doi: 10.48550/arXiv.1902.08153.
[35] A. Défossez, Y. Adi, and G. Synnaeve, “Differentiable Model Compression via Pseudo
Quantization Noise.” arXiv, Oct. 17, 2022.
[36] Y. Tang et al., “Patch slimming for efficient vision transformers,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12165–
12174.
[37] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynamicvit: Efficient vision
transformers with dynamic token sparsification,” Advances in neural information processing systems, vol. 34, pp. 13937–13949, 2021.
[38] Y. Zhao, W. Dai, Z. Wang, and A. E. Ragab, “Application of computer simulation to
model transient vibration responses of GPLs reinforced doubly curved concrete panel
under instantaneous heating,” Materials Today Communications, vol. 38, p. 107949,
Mar. 2024, doi: 10.1016/j.mtcomm.2023.107949.
[39] W. Dai, M. Fatahizadeh, H. G. Touchaei, H. Moayedi, and L. K. Foong, “Application
of six neural network-based solutions on bearing capacity of shallow footing on double-
layer soils,” Steel and Composite Structures, vol. 49, no. 2, pp. 231–244, 2023, doi:
10.12989/scs.2023.49.2.231.
[40] W. Dai, “Safety Evaluation of Traffic System with Historical Data Based on Markov
Process and Deep-Reinforcement Learning,” Journal of Computational Methods in
Engineering Applications, pp. 1–14, Oct. 2021.
[41] W. Dai, “Design of Traffic Improvement Plan for Line 1 Baijiahu Station of Nanjing
Metro,” Innovations in Applied Engineering and Technology, Dec. 2023, doi:
10.58195/iaet.v2i1.133.
[42] W. Dai, “Evaluation and Improvement of Carrying Capacity of a Traffic System,”
Innovations in Applied Engineering and Technology, pp. 1–9, Nov. 2022, doi:
10.58195/iaet.v1i1.001.
[43] H. Wang, Y. Zhou, E. Perez, and F. Roemer, “Jointly Learning Selection Matrices For
Transmitters, Receivers And Fourier Coefficients In Multichannel Imaging.” arXiv,
Feb. 29, 2024. Accessed: Mar. 23, 2024.
[44] L. Zhou, Z. Luo, and X. Pan, “Machine learning-based system reliability analysis with
Gaussian Process Regression.” arXiv, Mar. 17, 2024.
[45] M. Li, Y. Zhou, G. Jiang, T. Deng, Y. Wang, and H. Wang, “DDN-SLAM: Real-time
Dense Dynamic Neural Implicit SLAM.” arXiv, Mar. 08, 2024. Accessed: Mar. 23,
2024.
[46] Y. Zhou et al., “Semantic Wireframe Detection,” 2023, Accessed: Mar. 23, 2024.
[47] G. Tao et al., “Surf4 (Surfeit Locus Protein 4) Deficiency Reduces Intestinal Lipid
Absorption and Secretion and Decreases Metabolism in Mice,” ATVB, vol. 43, no. 4,
pp. 562–580, Apr. 2023, doi: 10.1161/ATVBAHA.123.318980.
[48] Y. Shen, H.-M. Gu, S. Qin, and D.-W. Zhang, “Surf4, cargo trafficking, lipid
metabolism, and therapeutic implications,” Journal of Molecular Cell Biology, vol. 14,
no. 9, p. mjac063, 2022.
[49] M. Wang et al., “Identification of amino acid residues in the MT-loop of MT1-MMP
critical for its ability to cleave low-density lipoprotein receptor,” Frontiers in
Cardiovascular Medicine, vol. 9, p. 917238, 2022.
[50] Y. Shen, H. Gu, L. Zhai, B. Wang, S. Qin, and D. Zhang, “The role of hepatic Surf4 in
lipoprotein metabolism and the development of atherosclerosis in apoE-/- mice,”
Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids, vol. 1867,
no. 10, p. 159196, 2022.
[51] B. Wang et al., “Atherosclerosis-associated hepatic secretion of VLDL but not PCSK9
is dependent on cargo receptor protein Surf4,” Journal of Lipid Research, vol. 62, 2021,
Accessed: Mar. 17, 2024.
标签:doi,Transformers,Compression,vol,Comprehensive,arXiv,2021,ViT,et From: https://www.cnblogs.com/gishuanhuan/p/18165685