InstructGPT《InstructGPT: Training language models to follow instructions with human feedback》解读

时间：2023-12-27 20:56:55浏览次数：37

标签：输出 Training feedback 训练模型摘要生成 InstructGPT

背景

GPT-3 虽然在各大 NLP 任务以及文本生成的能力上令人惊艳，但是他仍然还是会生成一些带有偏见的，不真实的，有害的造成负面社会影响的信息，而且很多时候，他并不按人类喜欢的表达方式去说话。在这个背景下，OpenAI 提出了一个概念“Alignment”，意思是模型输出与人类真实意图对齐，符合人类偏好。因此，为了让模型输出与用户意图更加 “align”，就有了 InstructGPT 这个工作

技术方案

有监督微调（SFT） + 强化学习训练（RLHF）

SFT（Supervised Fine-Tuning）

RLHF（Reinforcement Learning from Human Feedback）

主要分为三步：

1. 收集人类反馈：使用初始化模型对一个样本生成多个不同摘要，人工对多个摘要按效果进行排序，得到一批排好序的摘要样本；

2. 训练奖励模型：使用第1步得到的样本集，训练一个模型，该模型输入为一篇文章和对应的一个摘要，模型输出为该摘要的得分；

3. 训练策略模型：使用初始化的策略模型生成一篇文章的摘要，然后使用奖励模型对该摘要打分，再使用打分值借助 PPO 算法重新优化策略模型；

reward model

奖励模型的架构和GPT-3相同，只不过把最后一层换成投影层输出score，损失函数如下，和learning2rank的思路相似：

其中w排在l前面，其实就是最大化正序对score的差值

标签：输出,Training,feedback,训练,模型,摘要,生成,InstructGPT
From： https://www.cnblogs.com/xumaomao/p/17931400.html

GPT-1论文《Improving Language Understanding by Generative Pre-Training》解读
背景GPT-1采用了两阶段训练的方式：1. 第一阶段pre-training，在海量文本上训练，无需label，根据前k-1个词预测第k个单词是什么，第一阶段的训练让模型拥有了很多的先验知识，模型具有非常强的泛化性2.第二阶段在特定任务上fine-tuning，让模型能适应不同的任务，提高模型在特定任务上的准......
Feedback Control of Dynamic Systems_P1
GLOBALEDITION1.FeedbackControlofDynamicSystemsEIGHTHEDITIONFranklin$\cdot$Powell$・$Emami-NaeiniTableofLaplaceTransformsNumber$$F(s)$$$$f(t),t\geq0$$11$$\delta(t)$$2$$\frac{1}{s}$$$$1(t)$$3$$\frac{1}{s......
Feedback Control of Dynamic Systems_P2
187.ProblemsforSection5.4:DesignUsingDynamicCompensation5.21Let\[G(s)=\frac{1}{s^{2}+7s+12}\\text{~}\text{and}\text{~}\D_{c}(s)=K\frac{(s+a)}{s+b}\]Usingroot-locustechniques,findthevaluesfortheparameters\(a,b\......
A fast and simple algorithm for training neural probabilistic language models
目录概NoisecontrastiveestimationMnihA.andTehY.W.Afastandsimplealgorithmfortrainingneuralprobabilisticlanguagemodels.ICML,2012.概NCE用在语言模型的训练上.Noisecontrastiveestimation给定context$h$,下一个词为$w$的条件概率按......
【论文阅读笔记】【多模态-Referring & Grounding】 Grounded Language-Image Pre-tra
GLIPCVPR2022(Oral,BestPaperFinalist)读论文思考的问题论文试图解决什么问题？写作背景是什么？问题：如何将视觉-语言预训练技术应用在以目标检测为代表的fine-grainedimageunderstanding上面？如何在增加训练数据的同时，使目标检测模型具有良好的语义理解能力，能......
GLIP:Grounded Language-Image Pre-training
GroundedLanguage-ImagePre-training目录GroundedLanguage-ImagePre-training简介摘要Introduction统一的损失函数方法总结参考资料GLIPv1:GroundedLanguage-ImagePre-trainingGLIPv2:UnifyingLocalizationandVLUnderstanding代码地址:https://github.com/micr......
Misc_XCTF_WriteUp | Training-Stegano-1
题目提示：这是我能想到的最基础的图片隐写术题目：分析文件属性没有特别的东西。这么小的图片用StegSolve也看不见啥，用010editor打开看看：有一段文本，大意是：“看看十六进制编辑显示了什么:passwd:steganoI”将steganoI作为flag提交，通过。FlagsteganoI参考bmp位......
Web_XCTF_WriteUp | Training-WWW-Robots
题目分析标题大致翻译：训练WWW网络爬虫。场景内部文段大致翻译：在这个小小的训练挑战中，您将学习Robots_exclusion_standard（网络爬虫排除标准）。robots.txt文件用于网络爬虫检查它们是否被允许抓取和索引您的网站或仅部分网站。有时，这些文件揭示了目录结构，而不是保护内......
The Design of Feedback Control Systems--Advanced Problems
AP10.1Athree-axispick-and-placeapplicationrequirestheprecisemovementofaroboticarminthree-dimensionalspace,asshowninFigureAP10.1forjoint2.Thearmhasspecificlinearpathsitmustfollowtoavoidotherpiecesofmachinery.Theovers......
【论文阅读】Improving language understanding by generative pre-training
原始题目：Improvinglanguageunderstandingbygenerativepre-training中文翻译：通过生成预训练提高语言理解能力发表时间：2018年平台：Preprint文章链接：https://www.mikecaptain.com/resources/pdf/GPT-1.pdf开源代码：https://paperswithcode.com/paper/improving-language-und......

InstructGPT《InstructGPT: Training language models to follow instructions with human feedback》解读

背景

技术方案

SFT（Supervised Fine-Tuning）

RLHF（Reinforcement Learning from Human Feedback）

reward model

相关文章

赞助商

阅读排行