首页 > 其他分享 >论文解析 -- A Survey of Large Language Models

论文解析 -- A Survey of Large Language Models

时间:2023-05-25 11:13:26浏览次数:56  
标签:training tasks Language -- LLMs Large models GPT model

 

什么是语言模型?生成式,完成语言接龙或填空

Technically, language modeling (LM) is one of the major approaches to advancing language intelligence of machines.
In general, LM aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens.
The research of LM has received extensive attention in the literature, which can be divided into four major development stages: 

Statistical language models (SLM). 统计LM,N-Gram
SLMs [4–7] are developed based on statistical learning methods that rose in the 1990s.
The basic idea is to build the word prediction model based on the Markov assumption, e.g., predicting the next word based on the most recent context.
The SLMs with a fixed context length n are also called n-gram language models, e.g., bigram and trigram language models.
SLMs have been widely applied to enhance task performance in information retrieval (IR) [8, 9] and natural language processing (NLP) [10–12].  

Neural language models (NLM). 神经网络LM,RNN,word2vec
NLMs [15–17] characterize the probability of word sequences by neural networks, e.g., recurrent neural networks (RNNs).
Further, word2vec [19, 20] was proposed to build a simplified shallow neural network for learning distributed word representations, which were demonstrated to be very effective across a variety of NLP tasks.
These studies have initiated the use of language models for representation learning (beyond word sequence modeling), having an important impact on the field of NLP. 

Pre-trained language models (PLM). 预训练LM,ELMO,BERT,GPT2,需要针对特定任务fine-tuning
As an early attempt, ELMo [21] was proposed to capture context-aware word representations by first pre-training a bidirectional LSTM (biLSTM) network (instead of learning fixed word representations) and then fine-tuning the biLSTM network according to specific downstream tasks.
Further, based on the highly parallelizable Transformer architecture [22] with self-attention mechanisms, BERT [23] was proposed by pre-training bidirectional language models with specially designed pre-training tasks on large-scale unlabeled corpora.
These pre-trained context-aware word representations are very effective as general-purpose semantic features, which have largely raised the performance bar of NLP tasks. This study has inspired a large number of follow-up work, which sets the “pre-training and fine-tuning” learning paradigm.
Following this paradigm, a great number of studies on PLMs have been developed, introducing either different architectures [24, 25] (e.g., GPT-2 [26] and BART [24]) or improved pre-training strategies [27–29]. In this paradigm, it often requires fine-tuning the PLM for adapting to different downstream tasks. 

Large language models (LLM). 更大规模的PLM,GPT3,PaLM,产生emergent abilities
Researchers find that scaling PLM (e.g., scaling model size or data size) often leads to an improved model capacity on downstream tasks (i.e., following the scaling law [30]).
A number of studies have explored the performance limit by training an ever larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- parameter PaLM).
Although scaling is mainly conducted in model size (with similar architectures and pre-training tasks), these large-sized PLMs display different behaviors from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- parameter GPT-2) and show surprising abilities (called emergent abilities [31]) in solving a series of complex tasks.
For example, GPT-3 can solve few-shot tasks through in-context learning, whereas GPT-2 cannot do well.
Thus, the research community coins the term “large language models (LLM)” for these large-sized PLMs [32–35].
A remarkable application of LLMs is ChatGPT that adapts the LLMs from the GPT series for dialogue, which presents an amazing conversation ability with humans. 

 

PLM和LLM的区别,emergent abilites,以prompting为主要交互方式,训练LLM需要大量的工程经验

First, LLMs display some surprising emergent abilities that may not be observed in previous smaller PLMs. These abilities are key to the performance of language models on complex tasks, making AI algorithms unprecedently powerful and effective.
Second, LLMs would revolutionize the way that humans develop and use AI algorithms. Unlike small PLMs, the major approach to accessing LLMs is through the prompting interface (e.g., GPT-4 API). Humans have to understand how LLMs work and format their tasks in a way that LLMs can follow.
Third, the development of LLMs no longer draws a clear distinction between research and engineering. The training of LLMs requires extensive practical experiences in large-scale data processing and distributed parallel training. To develop capable LLMs, researchers have to solve complicated engineering issues, working with engineers or being engineers. 

 

LLM这次最大的影响,是让大家从新思考AGI的可能性。LLM的超出NLP领域的泛化能力,让大家觉得GPT4可能已经具备AGI的初级阶段。

Nowadays, LLMs are posing a significant impact on the AI community, and the advent of ChatGPT and GPT-4 leads to the rethinking of the possibilities of artificial general intelligence (AGI).
OpenAI has published a technical article entitled “Planning for AGI and beyond”, which discusses the short-term and long-term plans to approach AGI [40], and a more recent paper has argued that GPT-4 might be considered as an early version of an AGI system [41].  

 

OVERVIEW 

Background for LLMs 

Typically, large language models (LLMs) refer to Transformer language models that contain hundreds of billions (or more) of parameters, which are trained on massive text data [32], such as GPT-3 [55], PaLM [56], Galactica [35], and LLaMA [57]. LLMs exhibit strong capacities to understand natural language and solve complex tasks (via text generation). To have a quick understanding of how LLMs work, this part introduces the basic background for LLMs, including scaling laws, emergent abilities and key techniques. 

 

LLM和小模型的结构相同,单纯的只是模型size和数据size的显著变大。
所以介绍,KM scaling law和Chinchilla scaling law来量化,规模和能力直接的关系。

Scaling Laws for LLMs. Currently, LLMs are mainly built upon the Transformer architecture [22], where multi-head attention layers are stacked in a very deep neural network.
Existing LLMs adopt similar Transformer architectures and pre-training objectives (e.g., language modeling) as small language models.
While, LLMs largely scale the model size, data size, and total compute (orders of magnification).
Extensive research has shown that scaling can largely improve the model capacity of LLMs [26, 55, 56].
Thus, it is useful to establish a quantitative approach to characterizing the scaling effect. Next, we introduce two representative scaling laws for Transformer language models [30, 34]. 

KM scaling law. In 2020, Kaplan et al. [30] (the OpenAI team) firstly proposed to model the power-law relationship of model performance with respective to three major factors, namely model size (N), dataset size (D), and the amount of training compute (C), for neural language models. 

 

给出EA的定义,并且列出几种显著的EA能力,
ICL,无任何附加训练,根据上下文给出的例子完成任务,效果和具体合作任务相关
Instruction tuning,基于多种任务混合的数据集进行fine-tuning,对于未见过的新任务仍然可以解决,具备强泛化能力
CoT,按步骤推理能力

Emergent Abilities of LLMs. In the literature [31], emergent abilities of LLMs are formally defined as “the abilities that are not present in small models but arise in large models”, which is one of the most prominent features that distinguish LLMs from previous PLMs. It further introduces a notable characteristic when emergent abilities occur [31]: performance rises significantly above random when the scale reaches a certain level. By analogy, such an emergent pattern has close connections with the phenomenon of phase transition in physics [31, 58]. In principle, emergent abilities can be defined in relation to some complex tasks [31, 59], while we are more concerned with general abilities that can be applied to solve a variety of tasks.
Here, we briefly introduce three typical emergent abilities for LLMs and representative models that possess such an ability。

In-context learning. The in-context learning (ICL) ability is formally introduced by GPT-3 [55]: assuming that the language model has been provided with a natural language instruction and/or several task demonstrations, it can generate the expected output for the test instances by completing the word sequence of input text, without requiring additional training or gradient update.
Among the GPT-series models, the 175B GPT-3 model exhibited a strong ICL ability in general, but not the GPT-1 and GPT-2 models. While, such an ability also depends on the specific downstream task.
For example, the ICL ability can emerge on the arithmetic tasks (e.g., the 3-digit addition and subtraction) for the 13B GPT-3, but 175B GPT-3 even cannot work well on the Persian QA task 

Instruction following.  By fine-tuning with a mixture of multi-task datasets formatted via natural language descriptions (called instruction tuning), LLMs are shown to perform well on unseen tasks that are also described in the form of instructions [28, 61, 62]. With instruction tuning, LLMs are enabled to follow the task instructions for new tasks without using explicit examples, thus having an improved generalization ability. According to the experiments in [62], instruction-tuned LaMDA-PT [63] started to significantly outperform the untuned one on unseen tasks when the model size reached 68B, but not for 8B or smaller model sizes. A recent study [64] found that a model size of 62B is at least required for PaLM to perform well on various tasks in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA and MGSM), though a much smaller size might suffice for some specific tasks (e.g., MMLU). 

Step-by-step reasoning.  For small language models, it is usually difficult to solve complex tasks that involve multiple reasoning steps, e.g., mathematical word problems.
While, with the chain-of-thought (CoT) prompting strategy [33], LLMs can solve such tasks by utilizing the prompting mechanism that involves intermediate reasoning steps for deriving the final answer.
This ability is speculated to be potentially obtained by training on code [33, 47]. An empirical study [33] has shown that CoT prompting can bring performance gains (on arithmetic reasoning benchmarks) when applied to PaLM and LaMDA variants with a model size larger than 60B, while its advantage over the standard prompting becomes more evident when the model size exceeds 100B. Besides, the performance improvement with CoT prompting seems to be also varied for different tasks, e.g., GSM8K > MAWPS > SWAMP for PaLM [33]. 

 

LLM的核心技术有哪些?

规模,参数高达175B,540B
训练,如何高效低成本的训练海量训练集,现在LLM训练成本还是非常高的
AE,通过instruction tuning或是ICL,COT来激发LLM的潜在能力
Alignment tuning,典型的是InstructGPT,通过人的反馈,强化学习,让LLM符合人的价值观
工具,通过各种plugin来补全LLM的能力

Key Techniques for LLMs. 

Scaling. As discussed in previous parts, there exists an evident scaling effect in Transformer language models: larger model/data sizes and more training compute typically lead to an improved model capacity [30, 34]. As two representative models, GPT-3 and PaLM explored the scaling limits by increasing the model size to 175B and 540B, respectively. 

Training.  Due to the huge model size, it is very challenging to successfully train a capable LLM.
Distributed training algorithms are needed to learn the network parameters of LLMs, in which various parallel strategies are often jointly utilized.
To support distributed training, several optimization frameworks have been released to facilitate the implementation and deployment of parallel algorithms, such as DeepSpeed [65] and Megatron-LM [66–68]. Besides, optimization tricks are also important for training stability and model performance, e.g., restart to overcome training loss spike [56] and mixed precision training [69].
More recently, GPT-4 [46] proposes to develop special infrastructure and optimization methods that reliably predict the performance of large models with much smaller models. 

Ability eliciting. After being pre-trained on large-scale corpora, LLMs are endowed with potential abilities as general-purpose task solvers.
While, these abilities might not be explicitly exhibited when LLMs perform some specific tasks. As the technical approach, it is useful to design suitable task instructions or specific in-context learning strategies to elicit such abilities. For instance, chain- of-thought prompting has been shown to be useful to solve complex reasoning tasks by including intermediate reasoning steps. Besides, we can further perform instruction tuning on LLMs with task descriptions expressed in natural language, for improving the generalizability of LLMs on unseen tasks. While, these techniques mainly correspond to the emergent abilities of LLMs, which may not show the same effect on small language models. 

Alignment tuning. Since LLMs are trained to capture the data characteristics of pre-training corpora (including both high-quality and low-quality data), they are likely to generate toxic, biased, or even harmful content for humans. It is necessary to align LLMs with human values, e.g., helpful, honest, and harmless. For this purpose, InstructGPT [61] designs an effective tuning approach that enables LLMs to follow the expected instructions, which utilizes the technique of reinforcement learning with human feedback [61, 70]. 

Tools manipulation.  In essence, LLMs are trained as text generators over massive plain text corpora, thus performing less well on the tasks that are not best expressed in the form of text (e.g., numerical computation). Besides, their capacities are also limited to the pre-training data, e.g., the inability to capture up-to-date information. To tackle these issues, a recently proposed technique is to employ external tools to compensate for the deficiencies of LLMs [71, 72]. For example, LLMs can utilize the calculator for accurate computation [71] and employ search engines to retrieve unknown information [72]. More recently, ChatGPT has enabled the mechanism of using external plugins (existing or newly created apps)9, which are by analogy with the “eyes and ears” of LLMs. Such a mechanism can broadly expand the scope of capacities for LLMs. 

 

Technical Evolution of GPT-series Models 

OpenAI,很早就有用语言模型做智能系统的想法,早在用RNN的时候,现在说到RNN好像已经是远古兵器了
Early Explorations. According to one interview with Ilya Sutskever (a co-founder and chief scientist of OpenAI), the idea of approaching intelligent systems with language models was already explored in the early days of OpenAI, while it was attempted with recurrent neural networks (RNN) [104]. With the advent of Transformer, OpenAI developed two initial GPT models, namely GPT-1 [105] and GPT-2 [26], which can considered as the foundation to more powerful models subsequently i.e., GPT-3 and GPT-4. 

GPT-1,2017年google提出transform模型后,openAI,2018年发布的

GPT-1. In 2017, the Transformer model [22] was introduced by Google, and the OpenAI team quickly adapted their language modeling work to this new neural network architecture.
They released the first GPT model in 2018, i.e., GPT-1 [105], and coined the abbreviation term GPT as the model name, standing for Generative Pre-Training.
GPT-1 was developed based on a generative, decoder-only Transformer architecture, and adopted a hybrid approach of unsupervised pretraining and supervised fine-tuning.
GPT-1 has set up the core architecture for the GPT-series models and established the underlying principle to model natural language text, i.e., predicting the next word. 

GPT-2. 和1一样的架构,只是规模达到1.5B,并用大量的网页进行train
其实2的最大的不同是,提出一个泛化模型的概念,不用fine-tuning,直接用文字接龙的方式去解决各种的问题。

GPT-2. Following a similar architecture of GPT-1, GPT-2 [26] increased the parameter scale to 1.5B, which was trained with a large webpage dataset WebText.
As claimed in the paper of GPT-2, it sought to perform tasks via unsupervised language modeling, without explicit fine-tuning using labeled data.
To motivate the approach, they introduced a probabilistic form for multi-task solving, i.e., p(output|input,task) (similar approaches have been adopted in [106]), which predicts the output conditioned on the input and task information. To model this conditional probability, language text can be naturally employed as a unified way to format input, output and task information.
In this way, the process of solving a task can be cast as a word prediction problem for generating the solution text.

GPT-3. 2000年提出,参数规模达到175B,提出ICL的few-shot或zero-shot的方式。
3的出现标志着从PLM真正的进入到LLM的时代。

GPT-3. GPT-3 [55] was released in 2020, which scaled the model parameters to an ever larger size of 175B. In the GPT-3’s paper, it formally introduced the concept of in-context learning (ICL), which utilizes LLMs in a few-shot or zero-shot way. ICL can teach (or instruct) LLMs to understand the tasks in the form of natural language text.
With ICL, the pre-training and utilization of LLMs converge to the same language modeling paradigm: pre-training predicts the following text sequence conditioned on the context, while ICL predicts the correct task solution, which can be also formatted as a text sequence, given the task description and demonstrations.
GPT-3 not only demonstrates very excellent performance in a variety of NLP tasks but also on a number of specially designed tasks that require the abilities of reasoning or domain adaptation.
Overall, GPT-3 can be viewed as a remarkable landmark in the journey evolving from PLMs to LLMs. It has empirically proved that scaling the neural networks to a significant size can lead to a huge increase in model capacity. 

GPT-3.5. 基于3主要做了两点优化
基于code数据的训练,这不光是让LLM具备生成code的能力,还能大幅提升LLM的推理能力,COT的能力。
基于人类反馈的RL强化学习,典型的是InstructGPT

Training on code data. A major limitation of the original GPT-3 model (pre-trained on plain text) lies in the lack of reasoning ability on complex tasks, e.g., completing the code and solving math problems.
To enhance this ability, Codex [89] was introduced by OpenAI in July 2021, which was a GPT model fine-tuned on a large corpus of GitHub code. It demonstrated that Codex can solve very difficult programming problems, and also lead to a significant performance improvement in solving math problems [109].
Further, a contrastive approach [110] to training text and code embedding was reported in January 2022, which was shown to improve a series of related tasks (i.e., linear-probe classification, text search and code search).
Actually, the GPT-3.5 models are developed based on a code-based GPT model (i.e., code-davinci-002), which indicates that training on code data is a very useful practice to improve the model capacity of GPT models, especially the reasoning ability. Besides, there is also speculation that training on code data can greatly increase the chain-of-thought prompting abilities of LLMs [47], while it is still worth further investigation with more thorough verification. 

Human alignment.  The related research of human alignment can be dated back to the year 2017 (or earlier) for OpenAI: a blog article entitled “learning from human preferences” was posted on the OpenAI blog describing a work that applied reinforcement learning (RL) to learn from the preference comparisons annotated by humans [70] (similar to the reward training step in the aligning algorithm of InstructGPT in Figure 6). Shortly after the release of this RL paper [70], the paper of the Proximal Policy Optimization (PPO) [111] was published in July 2017, which now has been the foundational RL algorithm for learning from human preferences [61]. Later in January 2020, GPT-2 was fine-tuned using the aforementioned RL algorithms [70, 111], which leveraged human preferences to improve the capacities of GPT-2 on NLP tasks. In the same year, another work [112] trained a summarization model for optimizing human preferences in a similar way. Based on this prior work, InstructGPT [61] was proposed in January 2022 to improve the GPT-3 model for human alignment, which formally established a three-stage reinforcement learning from human feedback (RLHF) algorithm. Note that it seems that the wording of “instruction tuning” has seldom been used in OpenAI’s paper and documentation, which is substituted by supervised fine-tuning on human demonstrations (i.e., the first step of the RLHF algorithm [61]). In addition to improving the instruction following capacity, the RLHF algorithm is particularly useful to mitigate the issues of generating harm or toxic content for LLMs, which is key to the safe deployment of LLMs in practice. OpenAI describes their approach to alignment research in a technical article [113], which has summarized three promising directions: “training AI systems to use human feedback, to assist human evaluation and to do alignment research”. These enhancement techniques lead to improved GPT-3 models with stronger capacities, which are called GPT-3.5 models by OpenAI (see the discussion about the OpenAI API in Section 3.1). 

ChartGPT基本和InstrctGPT一样,加强基于对话的优化

ChatGPT. In November 2022, OpenAI released the conversation model ChatGPT, based on the GPT models (GPT-3.5 and GPT-4).
As the official blog article introduced [114], ChatGPT was trained in a similar way as InstructGPT (called “a sibling model to InstructGPT” in the original post), while specially optimized for dialogue.
They reported a difference between the training of ChatGPT and InstructGPT in the data collection setup: human-generated conversations (playing both the roles of user and AI) are combined with the InstructGPT dataset in a dialogue format for training ChatGPT. ChatGPT exhibited superior capacities in communicating with humans: possessing a vast store of knowledge, skill at reasoning on mathematical problems, tracing the context accurately in multi-turn dialogues, and aligning well with human values for safe use. Later on, the plugin mechanism has been supported in ChatGPT, which further extends the capacities of ChatGPT with existing tools or apps. So far, it seems to be the ever most powerful chatbot in AI history. The launch of ChatGPT has a significant impact on the AI research in the future, which sheds light on the exploration of human-like AI systems. 

GPT-4. 从文本到多模。更强大的能力。更安全

GPT-4. As another remarkable progress, GPT-4 [46] was released in March 2023, which extended the text input to multimodal signals.
Overall, GPT-4 has stronger capacities in solving complex tasks than GPT-3.5, showing a large performance improvement on many evaluation tasks.
A recent study [41] investigated the capacities of GPT-4 by conducting qualitative tests with human-generated problems, spanning a diverse range of difficult tasks, and showed that GPT-4 can achieve more superior performance than prior GPT models such as ChatGPT.
Furthermore, GPT-4 responds more safely to malicious or provocative queries, due to a six-month iterative alignment (with an additional safety reward signal in the RLHF training).
In the technical report, OpenAI has emphasized how to safely develop GPT-4 and applied a number of intervention strategies to mitigate the possible issues of LLMs, such as hallucinations, privacy and overreliance. For example, they introduced the mechanism called read teaming [115] to reduce the harm or toxic content generation. As another important aspect, GPT- 4 has been developed on a well-established deep learning infrastructure with improved optimization methods. They introduced a new mechanism called predictable scaling that can accurately predict the final performance with a small proportion of compute during model training. 

RESOURCES OF LLMS 

 

Publicly Available Model Checkpoints or APIs 

10Billion规模的模型

Models with Tens of Billions of Parameters.

Most of the models in this category have a parameter scale ranging from 10B to 20B, except LLaMA [57] (containing 65B parameters in the largest version) and NLLB [82] (containing 54.5B parameters in the largest version). Other models within this range include mT5 [74], PanGu-α [75], T0 [28], GPT- NeoX-20B [78], CodeGen [77], UL2 [80], Flan-T5 [64], and mT0 [84].
Among them, Flan-T5 (11B version) can serve as a premier model for research on instruction tuning, since it explores the instruction tuning from three aspects [64]: increasing the number of tasks, scaling the model size, and fine-tuning with chain-of-thought prompting data.
Besides, CodeGen (11B version), as an autoregressive language model designed for generating code, can be considered as a good candidate for exploring the code generation ability.
It also introduces a new benchmark MTPB [77] specially for multi-turn program synthesis, which is composed by 115 expert-generated problems. To solve these problems, it requires LLMs to acquire sufficient programming knowledge (e.g., math, array operations, and algorithms).
As for multilingual tasks, mT0 (13B version) might be a good candidate model, which has been fine-tuned on multilingual tasks with multilingual prompts.
Furthermore, PanGu- α [75] shows good performance in Chinese downstream tasks in zero-shot or few-shot settings, which is developed based on the deep learning framework MindSpore [117]. Note that PanGu-α [75] holds multiple versions of models (up to 200B parameters), while the largest public version has 13B parameters.
As a more recent release, LLaMA (65B version) [57], which contains approximately five times as many parameters as other models, has exhibited superior performance in tasks related to instruction following. Due to the openness and effectiveness, LLaMA has attracted significant attention from the research community, and many efforts [118–121] have been devoted to fine-tuning or continually pre-training its different model versions for implementing new models or tools.
Typically, pre-training models at this scale require hundreds or even thousands of GPUs or TPUs. For instance, GPT-NeoX-20B uses 12 supermicro servers, each equipped with 8 NVIDIA A100- SXM4-40GB GPUs, while LLaMA utilizes 2,048 A100-80G GPUs as reported in their original publications. To accurately estimate the computation resources needed, it is suggested to use the metrics measuring the number of involved computations such as FLOPS (i.e., FLoating point number Operations Per Second) [30]. 

100Billian基本的LLM,需要上千GPU训练

Models with Hundreds of Billions of Parameters.

For models in this category, only a handful of models have been publicly released. For example, OPT [81], OPT-IML [85], BLOOM [69], and BLOOMZ [84] have nearly the same number of parameters as GPT-3 (175B version), while GLM [83] and Galactica [35] have 130B and 120B parameters, respectively.
Among them, OPT (175B version) has been specially motivated for open sharing, which aims to enable researchers to carry out reproducible research at scale.
For research in cross-lingual generalization, BLOOM (176B version) and BLOOMZ (176B version) can be used as base models, due to the competence in multilingual language modeling tasks.
Among these models, OPT-IML have been tuned with instructions, which might be good candidates for studying the effect of instruction tuning.
Models of this scale typically require thousands of GPUs or TPUs to train. For instance, OPT (175B version) used 992 A100-80GB GPUs, while GLM (130B version) used a cluster of 96 NVIDIA DGX-A100 (8x40G) GPU nodes. 

OPENAI的API

Public API of LLMs.

Instead of directly using the model copies, APIs provide a more convenient way for common users to use LLMs, without the need of running the model locally.
As a representative interface for using LLMs, the APIs for the GPT-series models [46, 55, 61, 89] have been widely used for both academia and industry.
OpenAI has provided seven major interfaces to the models in GPT-3 series: ada, babbage, curie, davinci (the most powerful version in GPT-3 series), text-ada-001, text-babbage-001, and text-curie-001.
Among them, the first four interfaces can be further fine-tuned on the host server of OpenAI.
In particular, babbage, curie, and davinci correspond to the GPT-3 (1B), GPT-3 (6.7B), and GPT-3 (175B) models, respectively [55].

Besides, there are also two APIs related to Codex [89], called code-cushman-001 (a powerful and multilingual version of the Codex (12B) [89]) and code-davinci-002.
Further, GPT-3.5 series include one base model code-davinci-002 and three enhanced versions, namely text-davinci-002, text-davinci-003, and gpt-3.5-turbo-0301.
It is worth noting that gpt-3.5-turbo-0301 is the interface to invoke Chat-GPT.
More recently, OpenAI has also released the corresponding APIs for GPT-4, including gpt-4, gpt-4-0314, gpt-4-32k, and gpt-4-32k-0314.
Overall, the choice of API interfaces depends on the specific application scenarios and response requirements. The detailed usage can be found on their project websites 

Commonly Used Corpora 

- GPT-3 (175B) [55] was trained on a mixed dataset of 300B tokens, including CommonCrawl [132], WebText2 [55], Books1 [55], Books2 [55], and Wikipedia [128].

- PaLM (540B) [56] uses a pre-training dataset of 780B tokens, which is sourced from social media conversations, filtered webpages, books, Github, multilingual Wikipedia, and news.

- LLaMA [57] extracts training data from various sources, including CommonCrawl, C4 [73], Github, Wikipedia, books, ArXiv, and StackExchange. The training data size for LLaMA (6B) and LLaMA (13B) is 1.0T tokens, while 1.4T tokens are used for LLaMA (32B) and LLaMA (65B). 

Library Resource 

- Transformers [135] is an open-source Python library for building models using the Transformer architecture, which is developed and maintained by Hugging Face. It has a simple and user-friendly API, making it easy to use and customize various pre-trained models. It is a powerful library with a large and active community of users and developers who regularly update and improve the models and algorithms.

- DeepSpeed [65] is a deep learning optimization library (compatible with PyTorch) developed by Microsoft, which has been used to train a number of LLMs, such as MT- NLG [97] and BLOOM [69]. It provides the support of various optimization techniques for distributed training, such as memory optimization (ZeRO technique, gradient checkpointing), and pipeline parallelism.

- Megatron-LM [66–68] is a deep learning library developed by NVIDIA for training large-scale language models. It also provides rich optimization techniques for distributed training, including model and data parallelism, mixed-precision training, and FlashAttention. These optimization techniques can largely improve the training efficiency and speed, enabling efficient distributed training across GPUs.

- JAX [136] is a Python library for high-performance machine learning algorithms developed by Google, allowing users to easily perform computations on arrays with hardware acceleration (e.g., GPU or TPU). It enables efficient computation on various devices and also supports several featured functions, such as automatic differentiation and just-in-time compilation.

- Colossal-AI [137] is a deep learning library developed by HPC-AI Tech for training large-scale AI models. It is implemented based on PyTorch and supports a rich collection of parallel training strategies. Furthermore, it can also optimize heterogeneous memory management with methods proposed by PatrickStar [138]. Recently, a ChatGPT-like model called ColossalChat [121] has been publicly released with two versions (7B and 13B), which are developed using Colossal-AI based on LLaMA [57].

• BMTrain [139] is an efficient library developed by OpenBMB for training models with large-scale parameters in a distributed manner, which emphasizes code simplicity, low resource, and high availability. BMTrain has already incorporated several common LLMs (e.g., Flan-T5 [64] and GLM [83]) into its ModelCenter, where developers can use these models directly.

• FastMoE [140] is a specialized training library for MoE (i.e., mixture-of-experts) models. It is developed based on PyTorch, prioritizing both efficiency and user-friendliness in its design. FastMoE simplifies the process of transferring Transformer models to MoE models and supports both data parallelism and model parallelism during training.

Besides the above library resources, existing deep learning frameworks (e.g., PyTorch [141], TensorFlow [142], MXNet [143], PaddlePaddle [144], MindSpore [117] and OneFlow [145]) have also provided the support for parallel algorithms, which are commonly used for training large- scale models. 

 

PRE-TRAINING 

Data Collection 

Data Source 

The source of pre-training corpus can be broadly categorized into two types: general data and specialized data

General data, such as webpages, books, and conversational text, is utilized by most LLMs [55, 56, 81] due to its large, diverse, and accessible nature, which can enhance the language modeling and generalization abilities of LLMs. In light of the impressive generalization capabilities exhibited by LLMs, there are also studies that extend their pre-training corpus to more specialized datasets, such as multilingual data, scientific data, and code, endowing LLMs with specific task-solving capabilities [35, 56, 77]. In what follows, we describe these two types of pre-training data sources and their effects on LLMs.  

Data Preprocessing 

Architecture 

In general, the mainstream architectures of existing LLMs can be roughly categorized into three major types, namely encoder-decoder, causal decoder, and prefix decoder.

传统的transformer的架构,同时有encoder和decoder,PLM中有T5和BART使用,LLM使用这种架构的很少

Encoder-decoder Architecture. The vanilla Transformer model is built on the encoder-decoder architecture [22], which consists of two stacks of Transformer blocks as the encoder and decoder, respectively. The encoder adopts stacked multi-head self-attention layers to encode the input sequence for generating its latent representations, while the decoder performs cross-attention on these representations and autoregressively generates the target sequence.
Encoder-decoder PLMs (e.g., T5 [73] and BART [24]) have shown effectiveness on a variety of NLP tasks. So far, there are only a small number of LLMs that are built based on the encoder-decoder architecture, e.g., Flan-T5.

主流架构,decoder采用单向的attention mask,只看过去的上下文;典型的代表是GPT系列,多层的decoder的堆叠

Causal Decoder Architecture.  The causal decoder architecture incorporates the unidirectional(单向的) attention mask, to guarantee that each input token can only attend to the past tokens and itself.
The input and output tokens are processed in the same fashion through the decoder.
As representative language models of this architecture, the GPT-series models [26, 55, 105] are developed based on the causal-decoder architecture. In particular, GPT-3 [55] has successfully demonstrated the effectiveness of this architecture, also showing an amazing in-context learning capability of LLMs. Interestingly, GPT-1 [105] and GPT-2 [26] do not exhibit such superior abilities as those in GPT-3, and it seems that scaling plays an important role in increasing the model capacity of this model architecture.
So far, the causal decoders have been widely adopted as the architecture of LLMs by various existing LLMs, such as OPT [81], BLOOM [69], and Gopher [59]. Note that both the causal decoder and prefix decoder discussed next belong to decoder-only architectures. While, when mentioning “decoder-only architecture”, it mainly refers to the causal decoder architecture in existing literature, unless specified. 

 差别主要在attention的方式上,Causal都是单向的,但是Prefix,对于输入是双向的,对于输出是单向的

Prefix Decoder Architecture.  The prefix decoder architecture (a.k.a., non-causal decoder [169]) revises the masking mechanism of causal decoders, to enable performing bidirectional attention over the prefix tokens [170] and unidirectional attention only on generated tokens. In this way, like the encoder-decoder architecture, the prefix decoders can bidirectionally encode the prefix sequence and autoregressively predict the output tokens one by one, where the same parameters are shared during encoding and decoding. Instead of pre-training from scratch, a practical suggestion is to continually train causal decoders and then convert them into prefix decoders for accelerating convergence [29], e.g., U-PaLM [102] is derived from PaLM [56]. Existing representative LLMs based on prefix decoders include GLM- 130B [83] and U-PaLM [102]. 

 

Since the launch of Transformer [22], various improvements have been proposed to enhance its training stability, performance, and computational efficiency.
In this part, we will discuss the corresponding configurations for four major parts of the Transformer, including normalization, position embeddings, activation functions, and attention and bias

Normalization. Training instability is a challenging issue for pre-training LLMs.
To alleviate this problem, layer normalization (Layer Norm, LN) [173] is widely employed in Transformer architectures.
The position of LN is vital to the performance of LLMs. While the initial Transformer [22] uses post-LN, most LLMs employ pre-LN for more stable training in spite of decreasing performance [182].

Activation Functions.  To obtain good performance, activation functions also need to be properly set in feed-forward networks.
In existing LLMs, GeLU activations [185] are widely used.
Besides, in the latest LLMs (e.g., PaLM and LaMDA), variants of GLU activation [179, 186] have also been utilized, especially the SwiGLU and GeGLU variants, which often achieve better performance in practice [183]. 

Position Embeddings.  Since the self-attention modules in Transformer are permutation equivariant, position embeddings are employed to inject absolute or relative position information for modeling sequences

Model Training 

 

ADAPTATION TUNING OF LLMS 

In this section, we introduce two major approaches to adapting pre-trained LLMs, namely instruction tuning and alignment tuning.
The former approach mainly aims to enhance (or unlock) the abilities of LLMs, while the latter approach aims to align the behaviors of LLMs with human values or preferences.

这里往后讲的是基于pre-train的模型,进行fine-tuning的技术

Instruction Tuning

In essence, instruction tuning is the approach to fine-tuning pre-trained LLMs on a collection of formatted instances in the form of natural language [62], which is highly related to supervised fine-tuning [61] and multi-task prompted training [28]. In order to perform instruction tuning, we first need to collect or construct instruction-formatted instances. Then, we employ these formatted instances to fine-tune LLMs in a supervised learning way (e.g., training with the sequence-to-sequence loss). After instruction tuning, LLMs can demonstrate superior abilities to generalize to unseen tasks [28, 62, 64], even in a multilingual setting [84].

A recent survey [214] presents a systematic overview of the research on instruction tuning. In comparison to that, we mainly focus on the effect of instruction tuning on LLMs and provide detailed guidelines or strategies for instance collection and tuning. Besides, we also discuss the use of instruction tuning for satisfying the real needs of users, which has been widely applied in existing LLMs, e.g., InstructGPT [61] and GPT-4 [46]. 

如何构建Instruction,Table6给出现成的数据集,后续描述两种构建方法。

一种是来自已有的NLP打标数据集,将他们转换成task格式
一种是收集人类真实的需求,从OpenAI,QA网站,聊天室收集数据

Formatted Instance Construction

Generally, an instruction-formatted instance consists of a task description (called an instruction),  an input-output pair, and a small number of demonstrations (optional).
As important public resources, existing studies have released a large number of labeled data formatted in natural language (see the list of available resources in Table 6).


Next, we introduce two major methods for constructing formatted instances (see an illustration in Figure 5) and then discuss several key factors for instance construction. 

The Effect of Instruction Tuning 

Performance Improvement.  Despite being tuned on a moderate number of instances, instruction tuning has become an important way to improve or unlock the abilities of LLMs [64].
Recent studies have experimented with language models in multiple scales (ranging from 77M to 540B), showing that the models of different scales can all benefit from instruction tuning [64, 217], yielding improved performance as the parameter scale increases [84]. 

Task Generalization.  Instruction tuning encourages the model to understand natural language instructions for task completion. It endows LLMs with the ability (often considered as an emergent ability) to follow human instructions [31] to perform specific tasks without demonstrations, even on unseen tasks [64]. 

 

Alignment Tuning 

如何让LLM变的可控,主要依赖强化学习,RLHF

由三部分组成,一个待aligned的LM,一个reward模型来学习fb,一个RL算法

Background. LLMs have shown remarkable capabilities in a wide range of NLP tasks [55, 56, 62, 81]. However, these models may sometimes exhibit unintended behaviors, e.g., fabricating false information, pursuing inaccurate objectives, and producing harmful, misleading, and biased expressions [61, 222].  

To avert these unexpected behaviors, human alignment has been proposed to make LLMs act in line with human expectations [61, 100]. However, unlike the original pre-training and adaptation tuning (e.g., instruction tuning), such an alignment requires considering very different criteria (e.g., helpfulness, honesty, and harmlessness). 

Reinforcement Learning from Human Feedback 

To align LLMs with human values, reinforcement learning from human feedback (RLHF) [70, 226] has been proposed to fine-tune LLMs with the collected human feedback data, which is useful to improve the alignment criteria (e.g., helpfulness, honesty, and harmlessness).
RLHF employs reinforcement learning (RL) algorithms (e.g.,  Proximal Policy Optimization (PPO) [111]) to adapt LLMs to human feedback by learning a reward model.
Such an approach incorporates humans in the training loop for developing well-aligned LLMs, as exemplified by InstructGPT [61]. 

RLHF System.  The RLHF system mainly comprises three key components: a pre-trained LM to be aligned, a reward model learning from human feedback, and a RL algorithm training the LM.  

分成如下几步,

第一步,监督学习,比如instruct-tuning

第二步,Reward模型,对于LM输出的结果,让人来标注,用来训练RM可以用来给结果ranking

第三步,用强化学习训练LM,底下给出了强化学习的几要素

Supervised fine-tuning.  To make the LM initially perform desired behaviors, it usually needs to collect a supervised dataset containing input prompts (instruction) and desired outputs for fine-tuning the LM. These prompts and outputs can be written by human labelers for some specific tasks 

while ensuring the diversity of tasks. For example, Instruct-GPT [61] asks human labelers to compose prompts (e.g., “List five ideas for how to regain enthusiasm for my career”) and desired outputs for several generative tasks such as open QA, brainstorming, chatting, and rewriting. Note that the first step is optional in specific settings or scenarios. 

Reward model training.  The second step is to train the RM using human feedback data.
Specifically, we employ the LM to generate a certain number of output texts using sampled prompts (from either the supervised dataset or the human-generated prompt) as input. We then invite human labelers to annotate the preference for these pairs. The annotation process can be conducted in multiple forms, and a common approach is to annotate by ranking the generated candidate texts, which can reduce the inconsistency among annotators. Then, the RM is trained to predict the human-preferred output. In InstructGPT, labelers rank model-generated outputs from best to worst, and the RM (i.e., 6B GPT-3) is trained to predict the ranking. 

RL fine-tuning.  At this step, aligning (i.e., fine-tuning) the LM is formalized as an RL problem.
In this setting, the pre-trained LM acts as the policy that takes as input a prompt and returns an output text, the action space of it is the vocabulary, the state is the currently generated token sequence, and the reward is provided by the RM. To avoid eviating significantly from the initial (before tuning) LM, a penalty term is commonly incorporated into the reward function. For example, InstructGPT optimizes the LM against the RM using the PPO algorithm. For each input prompt, InstructGPT calculates the KL divergence between the generated results from the current LM and the initial LM as the penalty. It is noted that the second and final steps can be iterated in multiple turns for better-aligning LLMs. 

 

Efficient Tuning 

In this section, we will discuss how to conduct efficient tuning on LLMs. We first review several representative parameter-efficient fine-tuning methods for Transformer language models, and then summarize existing work on parameter-efficient fine-tuned LLMs. 

未完

 

 

 

标签:training,tasks,Language,--,LLMs,Large,models,GPT,model
From: https://www.cnblogs.com/fxjwind/p/17411328.html

相关文章

  • 2P4M-ASEMI代理伟达原装单向可控硅2P4M
    编辑:ll2P4M-ASEMI代理伟达原装单向可控硅2P4M型号:2P4M品牌:韦达\WEIDA封装:TO-252正向电流:2A反向电压:600V引脚数量:3芯片个数:1芯片尺寸:漏电流:>10ua恢复时间:浪涌电流:30A包装方式:盘装封装尺寸:如图特性:单向可控硅工作结温:-40℃~150℃2P4M的电性参数:正向电流2A;反向电压6......
  • openEuler root账户执行文件但是permission denied
    查看是否有可执行权限x,查看是否有rwx的x权限:ls-lfilename 没有就加上:chmod+xfilename ......
  • day 105 - javaBean
    javaBean是一种实体类JavaBean有特定的写法必须有一个无参构造属性必须私有化必须有对应的get,set方法一般用来和数据库字段做映射:ORMORM:对象关系映射表-->类字段-->属性行记录-->对象实现创建数据库,创建对应实体类 //实体类,和数据库中的表结构......
  • LINUX系列-网络篇
    一网卡配置配置文件位置:/etc/sysconfig/network-scripts/ifcfg-eth01.DEVICE=eth0网卡名字2.HWADDR=00:0c:29:90:89:d9HWADDRHardWareAddress硬件地址MAC地址3.TYPE=Ethernet网络类型。以太网4.UUID=ae779ae6-044d-43d5-a33b-48c89e8de10e#UUID做到系统中独一......
  • 利用gpt学习笔记
    如果您想要将t.sample_type_code的前两个字符与td.template_code进行匹配,可以使用LEFT()函数来提取子字符串,并将其作为连接条件。以下是修改后的查询语句:SELECT*FROMt_sample_type_templatetLEFTJOINt_template_datatdONLEFT(t.sample_type_code,2)=LEFT(td.......
  • 实例化和初始化的区别?Spring依赖注入和属性赋值
    实例化和初始化的区别Spring依赖注入IOC(给字段赋值)和Spring测试 ......
  • C# 打印PDF文档的10种方法
    转:C#打印PDF文档的10种方法-知乎(zhihu.com)前言 操作PDF文档时,打印是常见的需求之一。针对不同的打印需求,可分多种情况来进行,如设置静默打印、指定打印页码范围和打印纸张大小、双面打印、黑白打印等等。 经过测试,下面将对常见的几种PDF打印需求做一些归纳总结,这里归......
  • .NET中使用redis
    NuGet中安装对应的redis操作工具:StackExchange.Redis redis帮助类: ///<summary>   ///Redis读写帮助类   ///</summary>   publicclassRedisHelper   {       privatestringRedisConnectionStr=ConfigurationManager.AppSettings["RedisC......
  • Jquery Deferred 对比 Promise
    javascript处理异步逻辑有多种方式,这里只对比JQuery的Deferred和ES6的Promise。场景是判断网页中所有图片是否加载完(加载异常404也算加载完毕)。JQuery Deferred方式1varimgdefereds=[];2$('img').each(function(){3vardfd=$.Deferred();45$......
  • jupyter 报错 500 : internal server error
    之前代码搬迁服务器出了如下问题:jupyter报错500:internalservererror老服务器charset-normalizer的版本是3.0.1,但是看知乎有个方法如下:pipinstall--force-reinstallcharset-normalizer==3.1.0也可以解决问题,就没重装3.0.1......