What is pre-training?
Self-supervised learning on the large set of unlabeled data.
Pre-trained Model Architecture | Pre-training task | Task Type | Example |
---|---|---|---|
Encoder-only (AutoEncoder) | Masked Language Model | NLU | BERT Family |
Decoder-only (AutoRegression) | Causal Language Model or Prefix Language Model | NLG | GPT, Llama, Bloom |
Encoder-Decoder (Seq2Seq) | Sequence to Sequence Model | Conditional-NLG | T5, BART |
Layer Normalization
- Post-LN
- Pre-LN
- Sanwich-LN
Model | Nomalization |
---|---|
GPT3 | Pre Layer Norm |
Llama | Pre RMS Norm |
baichuan | Pre RMS Norm |
ChatGLM-6B | Post Deep Norm |
ChatGLM2-6B | Post RMS Norm |
Bloom | Pre Layer Norm |
Attention
- Bidirectional attention: Encoder
- unidirectional or one-way attention: Decoder
注意力得分矩阵是下三角矩阵
标签:Pre,RMS,LN,trained,Model,Norm From: https://www.cnblogs.com/forhheart/p/18094372