首页 > 其他分享 >Quantization: fp16, bf16, int8, fp4, nf4

Quantization: fp16, bf16, int8, fp4, nf4

时间:2024-05-03 16:11:06浏览次数:21  
标签:bf16 propagation Gradient Bytes precision Quantization fp4 Forward bits

1 GPU Memory Usage

1.1 How to Compute

How to compute GPU Memory Usage?

Model size:
Model Weights: 4Bytes * num_param
Optimizer: 4Bytes * 2 * num_param (for AdamW)
Gradient: 4Bytes * num_param
feed forward:
sum:

1.2 How to Reduce

Strategy 1:

Optimization Strategy Optimization Object Description Training Time
Baseline -
+ Gradient Accumulation Forward propagation value
+ Gradient Checkpoints
Trainer(gradient_checkingpoint = True)
Forward propagation value not save the immediate weights and values take more time -> get less memory
+ Adafactor Optimizer Optimizer
+ Freeze Model Forward propagation value / Gradient
+ Data Length Forward propagation value

Strategy 2: Reduce the number of parameters
PEFT(Prompt Tuning, LoRA...)
Strategy 3: Reduce the number of bytes each parameter occupies
The default precision is single precision, which is represented as fp32, using 32 bits to represent one digit.

Name
Single-precision floating-point format fp32 4 Bytes 32 bits
Half-precision floating-point format fp16 2 Bytes 16 bits
Brain floating-point format(BFloat16) bp16 2 Bytes 16 bits
int8 1 Bytes 8 bits
fp4 0.5 Bytes 4 bits
4-bit NormalFloat nf4 0.5 Bytes 4 bits

2 Precision

02 - Half precision & LLaMA 2
03 - Half precision & ChatGLM 3
04 - 8 Bit
05 - 4 Bit & QLoRA

Reference

手把手带你实战HuggingFace Transformers-实战篇

标签:bf16,propagation,Gradient,Bytes,precision,Quantization,fp4,Forward,bits
From: https://www.cnblogs.com/forhheart/p/18171303

相关文章