首页 > 其他分享 >VIT模型压缩综述(Comprehensive Survey of Model Compression and Speed up for Vision Transformers)

VIT模型压缩综述(Comprehensive Survey of Model Compression and Speed up for Vision Transformers)

时间:2024-04-29 20:57:36浏览次数:23  
标签:doi Transformers Compression vol Comprehensive arXiv 2021 ViT et

摘要:  

VIT 是计算机视觉的开山之作,在不同的领域都表现了优越的性能.但是过大的计算量和内存需求,限制了其使用.该篇综述研究了并评估了四种模型压缩的方法,在VIT上的作用:量化,低秩分解,知识蒸馏,剪枝. 系统的分析并比较了这些方法在资源受限的环境下,优化VIT方面的效果.我们的压缩实验结果说明这些压缩方法促进了在模型体积压缩和精度做了很好的平衡.

相关工作:

  • 量化:  量化已经在神经网络的性能提升上取得了里程碑式的效果. 量化就是将网络转为低比特,来降低对计算资源和内存资源的需求,而对模型精度基本不影响.量化的关键是确定权重的clipping的范围. Krishnmoorthi[15] 建议确定这个范围通过评估卷积层的所有滤波器的权重, Shen[16] 等人则使用分组量化的方法. 为了降低精度的损失,提出了QAT(Quantization-Aware Training) .QAT 方法是量化的模型使用浮点数进行标准的前向反向传播,然后梯度计算后,再次量化模型,因此可以保留量化的效果.
  • 低秩分解:  chen[17]强调VIT的注意力矩阵天生拥有低秩属性[18], 提供了降低复杂度的可能[19]. [20,21,22,23] 等方法使用低秩分解的方法. 而将低秩分解与稀疏注意力机制结合,会取得更好的效果[17]
  • 知识蒸馏:  Touvron[27]  引入了蒸馏token.这个token类似于类的token,但是专注于捕获教师网络的预测结果,通过自注意力参与蒸馏,优化蒸馏过程.这种蒸馏方法已经获得了可观的效果相对于卷积蒸馏.
  • 剪枝: 由[29]Yang提出的维度重新分布的策略,被整合到剪枝过程中,进一步优化模型的性能.

实验:

     对比了四种压缩方法在VIT模型上的压缩效果.主要观察的指标:模型大小和推理速度.

 硬件条件: Tesla V100-SXM2 16GB GPU,

 数据:CIFAR-10,CIFAR-100

  不同模型压缩方法对比: 评估模型压缩方法对模型大小的影响,发现量化和剪枝能最大程度降低模型大小并最小的精度损失。特别的,量化方法,尤其是动态量化,有最优的效果,动态量化使用的是pytorch的量化API

权重裁剪,尤其使用简单的重要性分数,在模型大小和精度方面没有取得很好的效果,对模型精度影响很大。剪枝率0.1(剪掉10%的参数),在两个数据及上都产生了严重的精度损失。

对于推理速度,知识蒸馏有尤其明显的效果。尤其,DeiT base模型,获取了接近两倍的加速而且精度几乎不受影响。

速度值是每秒的循环次数

                                      CIFAR-10

Model Method Accuracy GPU speed CPU speed Size(MB)
Vanilla ViT [14]   98.94 4.48 0.050 327
Dynamic Quantization PTQ 98.73 - 0.062 84
FQ-ViT[32] PTQ 97.31 - - -
DIFFQ with LSQ[33] QAT 93.37 2.10 - 41
DIFFQ with diffq [34] QAT 60.29 12.20 - 2
DeiT base [27] Knowledge Distillation 98.47 7.04 0.096 327
DeiT tiny [27] Knowledge Distillation 95.43 16.78 - 21
ViT-Pruning(r=0.1)[28] Pruning 88.36 4.86 - 301
ViT-Pruning(r=0.2) [28] Pruning 80.56 5.54 - 254
ViT-Nystr(m=24) [21] Low-rank Approximation 65.91 4.67 - 327
ViT-Nystr(m=32) [21] Low-rank Approximation 75.94 4.57 - 327
ViT-Nystr(m=64) [21] Low-rank Apporximation 91.70 4.38 - 327
DeiT base +Dynamic Quantization Knowledge Distillation+PTQ 96.75 - 0.117 84

 

 

                                            CIFAR-100

Model Method Accuracy GPU speed CPU speed Size(MB)
Vanilla ViT [14]   92.87 4.34 0.093 327
Dynamic Quantization PTQ 90.87 - 0.122 84
FQ-ViT[32] PTQ 84.87 - - -
DIFFQ with LSQ[33] QAT 76.08 2.10 - 41
DIFFQ with diffq [34] QAT 41.02 12.00 - 2
DeiT base [27] Knowledge Distillation 87.35 6.97 0.149 327
DeiT tiny [27] Knowledge Distillation 75。90 16.16 - 21
ViT-Pruning(r=0.1)[28] Pruning 74.46 4.69 - 302
ViT-Pruning(r=0.2) [28] Pruning 64.27 5.19 - 272
ViT-Nystr(m=24) [21] Low-rank Approximation 38.51 4.7 - 327
ViT-Nystr(m=32) [21] Low-rank Approximation 50.31 4.65 - 327
ViT-Nystr(m=64) [21] Low-rank Apporximation 74.01 4.46 - 327
DeiT base +Dynamic Quantization Knowledge Distillation+PTQ 82.61 - 0.196 84

 

 

[15] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A
whitepaper. arXiv 2018,” arXiv preprint arXiv:1806.08342, 1806.
[16] S. Shen et al., “Q-bert: Hessian based ultra low precision quantization of bert,” in
Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 8815–8821.
[17] B. Chen, T. Dao, E. Winsor, Z. Song, A. Rudra, and C. Ré, “Scatterbrain: Unifying
Sparse and Low-rank Attention Approximation.” arXiv, Oct. 28, 2021.
[18] F. Chen, N. Chen, H. Mao, and H. Hu, “The Application of Bipartite Matching in
Assignment Problem. arXiv 2019,” arXiv preprint arXiv:1902.00256.

[19] F. Chen, N. Chen, H. Mao, and H. Hu, “An efficient sorting algorithm-Ultimate
Heapsort (UHS). 2019.”
[20] J. Lu et al., “Soft: Softmax-free transformer with linear complexity,” Advances in
Neural Information Processing Systems, vol. 34, pp. 21297–21309, 2021.
[21] X. Yunyang et al., “Nyströmformer: A Nystöm-based Algorithm for Approximating
Self-Attention,” AAAI, 2021.
[22] K. Choromanski et al., “Rethinking Attention with Performers.” arXiv, Nov. 19, 2022.
doi: 10.48550/arXiv.2009.14794.
[23] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-Attention with
Linear Complexity.” arXiv, Jun. 14, 2020.
[24] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network.”
arXiv, Mar. 09, 2015.
[25] L. Yuan et al., “Tokens-to-token vit: Training vision transformers from scratch on
imagenet,” in Proceedings of the IEEE/CVF international conference on computer
vision, 2021, pp. 558–567.
[26] L. Wei, A. Xiao, L. Xie, X. Zhang, X. Chen, and Q. Tian, “Circumventing Outliers of
AutoAugment with Knowledge Distillation,” in Computer Vision – ECCV 2020, vol.
12348, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., in Lecture Notes in
Computer Science, vol. 12348. , Cham: Springer International Publishing, 2020, pp.
608–625. doi: 10.1007/978-3-030-58580-8_36.
[27] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training
data-efficient image transformers & distillation through attention,” in Proceedings of
the 38th International Conference on Machine Learning, PMLR, Jul. 2021, pp. 10347–
10357.
[28] M. Zhu, Y. Tang, and K. Han, “Vision Transformer Pruning.” arXiv, Aug. 14, 2021.
doi: 10.48550/arXiv.2104.08500.
[29] S. Yu et al., “Unified Visual Transformer Compression.” arXiv, Mar. 15, 2022. doi:
10.48550/arXiv.2203.08243.
[30] H. Yang, H. Yin, P. Molchanov, H. Li, and J. Kautz, “Nvit: Vision transformer
compression and parameter redistribution,” 2021.
[31] H. Yu and J. Wu, “A unified pruning framework for vision transformers,” Sci. China
Inf. Sci., vol. 66, no. 7, p. 179101, Jul. 2023, doi: 10.1007/s11432-022-3646-6.
[32] Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization
for vision transformer,” Advances in Neural Information Processing Systems, vol. 34,
pp. 28092–28103, 2021.
[33] Y. Lin, T. Zhang, P. Sun, Z. Li, and S. Zhou, “Fq-vit: Fully quantized vision
transformer without retraining,” arXiv preprint arXiv:2111.13824, 2021.
[34] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned
Step Size Quantization.” arXiv, May 06, 2020. doi: 10.48550/arXiv.1902.08153.
[35] A. Défossez, Y. Adi, and G. Synnaeve, “Differentiable Model Compression via Pseudo
Quantization Noise.” arXiv, Oct. 17, 2022.
[36] Y. Tang et al., “Patch slimming for efficient vision transformers,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12165–
12174.
[37] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynamicvit: Efficient vision
transformers with dynamic token sparsification,” Advances in neural information processing systems, vol. 34, pp. 13937–13949, 2021.

[38] Y. Zhao, W. Dai, Z. Wang, and A. E. Ragab, “Application of computer simulation to
model transient vibration responses of GPLs reinforced doubly curved concrete panel
under instantaneous heating,” Materials Today Communications, vol. 38, p. 107949,
Mar. 2024, doi: 10.1016/j.mtcomm.2023.107949.
[39] W. Dai, M. Fatahizadeh, H. G. Touchaei, H. Moayedi, and L. K. Foong, “Application
of six neural network-based solutions on bearing capacity of shallow footing on double-
layer soils,” Steel and Composite Structures, vol. 49, no. 2, pp. 231–244, 2023, doi:
10.12989/scs.2023.49.2.231.
[40] W. Dai, “Safety Evaluation of Traffic System with Historical Data Based on Markov
Process and Deep-Reinforcement Learning,” Journal of Computational Methods in
Engineering Applications, pp. 1–14, Oct. 2021.
[41] W. Dai, “Design of Traffic Improvement Plan for Line 1 Baijiahu Station of Nanjing
Metro,” Innovations in Applied Engineering and Technology, Dec. 2023, doi:
10.58195/iaet.v2i1.133.
[42] W. Dai, “Evaluation and Improvement of Carrying Capacity of a Traffic System,”
Innovations in Applied Engineering and Technology, pp. 1–9, Nov. 2022, doi:
10.58195/iaet.v1i1.001.
[43] H. Wang, Y. Zhou, E. Perez, and F. Roemer, “Jointly Learning Selection Matrices For
Transmitters, Receivers And Fourier Coefficients In Multichannel Imaging.” arXiv,
Feb. 29, 2024. Accessed: Mar. 23, 2024.
[44] L. Zhou, Z. Luo, and X. Pan, “Machine learning-based system reliability analysis with
Gaussian Process Regression.” arXiv, Mar. 17, 2024.
[45] M. Li, Y. Zhou, G. Jiang, T. Deng, Y. Wang, and H. Wang, “DDN-SLAM: Real-time
Dense Dynamic Neural Implicit SLAM.” arXiv, Mar. 08, 2024. Accessed: Mar. 23,
2024.
[46] Y. Zhou et al., “Semantic Wireframe Detection,” 2023, Accessed: Mar. 23, 2024.
[47] G. Tao et al., “Surf4 (Surfeit Locus Protein 4) Deficiency Reduces Intestinal Lipid
Absorption and Secretion and Decreases Metabolism in Mice,” ATVB, vol. 43, no. 4,
pp. 562–580, Apr. 2023, doi: 10.1161/ATVBAHA.123.318980.
[48] Y. Shen, H.-M. Gu, S. Qin, and D.-W. Zhang, “Surf4, cargo trafficking, lipid
metabolism, and therapeutic implications,” Journal of Molecular Cell Biology, vol. 14,
no. 9, p. mjac063, 2022.
[49] M. Wang et al., “Identification of amino acid residues in the MT-loop of MT1-MMP
critical for its ability to cleave low-density lipoprotein receptor,” Frontiers in
Cardiovascular Medicine, vol. 9, p. 917238, 2022.
[50] Y. Shen, H. Gu, L. Zhai, B. Wang, S. Qin, and D. Zhang, “The role of hepatic Surf4 in
lipoprotein metabolism and the development of atherosclerosis in apoE-/- mice,”
Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids, vol. 1867,
no. 10, p. 159196, 2022.
[51] B. Wang et al., “Atherosclerosis-associated hepatic secretion of VLDL but not PCSK9
is dependent on cargo receptor protein Surf4,” Journal of Lipid Research, vol. 62, 2021,
Accessed: Mar. 17, 2024.

 

标签:doi,Transformers,Compression,vol,Comprehensive,arXiv,2021,ViT,et
From: https://www.cnblogs.com/gishuanhuan/p/18165685

相关文章

  • Compression Stream API
    使用gzip或者默认格式压缩和解压缩数据<!DOCTYPEhtml><htmllang="en"><head><metacharset="UTF-8"/><metaname="viewport"content="width=device-width,initial-scale=1.0"/><titl......
  • [基础] DETR:End-to-End Object Detection with Transformers
    名称End-to-EndObjectDetectionwithTransformers时间:22.05机构:FacebookAITL;DR文章提出一种称为DETR(DetectionTransformer)的基于Transformer的检测器,相比于传统检测器不需要NMS以及anchor,仅需要少量objectqueries就可以同时推理出所有预测结果。MethodInference......
  • Hugging Face Transformers 萌新完全指南
    欢迎阅读《HuggingFaceTransformers萌新完全指南》,本指南面向那些意欲了解有关如何使用开源ML的基本知识的人群。我们的目标是揭开HuggingFaceTransformers的神秘面纱及其工作原理,这么做不是为了把读者变成机器学习从业者,而是让为了让读者更好地理解transformers从而能......
  • transformers、torch train demo
    通过pytorch训练模型的逻辑:importtorch.nnasnnimporttorchimportnumpy#fromtorch.utils.tensorboardimportSummaryWriterimporttimevocabList=["0","1","2","3","4","5","6","7"......
  • A Comprehensive Evaluation of Sequential Memory Editing in Large Language Models
    本文是LLM系列文章,针对《NavigatingtheDualFacets:AComprehensiveEvaluationofSequentialMemoryEditinginLargeLanguageModels》的翻译。双向导航:大型语言模型中顺序记忆编辑的综合评价摘要1引言2相关工作3符号和背景4实验设置5ME对LLM的评估......
  • [Paper Reading] VQ-GAN: Taming Transformers for High-Resolution Image Synthesis
    名称[VQ-GAN](TamingTransformersforHigh-ResolutionImageSynthesis)时间:CVPR2021oral21.06机构:HeidelbergCollaboratoryforImageProcessing,IWR,HeidelbergUniversity,GermanyTL;DRTransformer优势在于能较好地长距离建模sequence数据,而CNN优势是天生对局部......
  • Channel-Wise Autoregressive Entropy Models For Learned Image Compression
    目录简介创新点模型框架信道条件熵模型实验&结果简介熵约束自动编码器的熵模型同时使用前向适应和后向适应。前向自适应利用边信息,可以被有效加入到深度网络中。后向自适应通常基于每个符号的因果上下文进行预测,这需要串行处理,这妨碍了GPU/TPU的有效利用。创新点本文引......
  • [基础] DiT: Scalable Diffusion Models with Transformers
    名称DiT:ScalableDiffusionModelswithTransformers时间:23/03机构:UCBerkeley&&NYUTL;DR提出首个基于Transformer的DiffusionModel,效果打败SD,并且DiT在图像生成任务上随着Flops增加效果会降低,比较符合scalinglaw。后续sora的DM也使用该网络架构。Method网络结构整......
  • [npm] npm打包/运行时,报:"95% emitting CompressionPlugin ERROR Error: error:030801
    1问题描述环境信息windows10node:v20.11.1>node--versionv20.11.1vue:2.6.12[dependencies]"vue":"2.6.12""vue-count-to":"1.0.13""vue-cropper":"0.5.5""vue-meta":&q......
  • 【论文笔记合集】Transformers in Time Series A Survey综述总结
    本文作者:slience_me文章目录TransformersinTimeSeriesASurvey综述总结1Introduction2Transformer的组成PreliminariesoftheTransformer2.1VanillaTransformer2.2输入编码和位置编码InputEncodingandPositionalEncoding绝对位置编码AbsolutePosit......