首页 > 其他分享 >【Coursera GenAI with LLM】 Week 3 Reinforcement Learning from Human Feedback Class Notes

【Coursera GenAI with LLM】 Week 3 Reinforcement Learning from Human Feedback Class Notes

时间:2024-03-15 12:22:37浏览次数:28  
标签:Week Learning Notes PPO LLM RL model reward

Helpful? Honest? Harmless? Make sure AI response in those 3 ways.

If not, we need RLHF is reduce the toxicity of the LLM.

Reinforcement learning: is a type of machine learning in which an agent learns to make decisions related to a specific goal by taking actions in an environment, with the objective of maximizing some notion of a cumulative reward. RLHF can help making personalized LLMs.

RLHF cycle: iterate until reward score is high:

  1. Select an instruct model, define your model alignment criterion (ex. helpfulness)
  2. obtain human feedback through labeler workforce to rate the completions
  3. Convert rankings into pairwise training data for the reward model
  4. Train reward model to predict preferred completion from {y_j, y_k} for prompt x
  5. Use the reward model as a binary classifier to automatically provide reward value for each prompt-completion pair
    lower reward score, worse the performance
    softmax(logits) = probabilities

RL Algorithm

  • RL algorithm updates the weights off the LLM based on the reward is signed to the completions generated by the current version off the LLM
  • ex. Q-Learning, PPO (Proximal Policy Optimization, the most popular method)
  • PPO optimize LLM to more aligned with human preferences

Reward hacking: the model will achieve high reward score but it actually doesn't align with the criterion, the quality is not improved

  • To avoid this, we can use the initial instruct model (aka reference model). * during training, we pass prompt dataset to both reference model and RL-updated LLM,

  • Then, we calculate KL Divergence Shift Penalty (a statistical measure of how different two probability distributions are) between two models

  • Add the penalty to the Reward Model, then go through PPO, PEFT, and back to reward model

Constitutional AI

  • First proposed in 2022 by researchers at Anthropic
  • a method for training models using a set of rules and principles that govern the model's behavior.

Red Teaming: make it to generate harmful responses. Then, remove all harmful responses

标签:Week,Learning,Notes,PPO,LLM,RL,model,reward
From: https://www.cnblogs.com/miramira/p/18073641

相关文章

  • Coursera自然语言处理专项课程01:Natural Language Processing with Classification an
    NaturalLanguageProcessingwithClassificationandVectorSpacesCourseCertificate本文是NaturalLanguageProcessingwithClassificationandVectorSpaces这门课的学习笔记,仅供个人学习使用,如有侵权,请联系删除。文章目录NaturalLanguageProcessingwi......
  • 【WEEK2】 【DAY5】Data Processing and Redirection - Data Processing【English Ver
    2024.3.8FridayFollowingthepreviousarticle【WEEK2】【DAY4】DataProcessingandRedirection-MethodsofResultRedirection【EnglishVersion】Contents5.2.DataProcessing5.2.1.SubmittingProcessedData5.2.1.1.Thesubmittedfieldnamematches......
  • 【WEEK2】 【DAY5】数据处理及跳转之数据处理【中文版】
    2024.3.8Friday接上文【WEEK2】【DAY4】数据处理及跳转之结果跳转方式【中文版】目录5.2.数据处理5.2.1.提交处理数据5.2.1.1.提交的域名称和处理方法的参数名一致5.2.1.2.提交的域名称和处理方法的参数名不一致5.2.1.3.实例:新建文件1.UserController.java2.运......
  • 【Coursera GenAI with LLM】 Week 2 PEFT Class Notes
    WithPEFT,weonlytrainonsmallportionofparameters!What'susingmemorywhiletrainingmodel?TrainableweightsOptimizerstatesGradientsForwardActivationsTemporarymemoryPEFTTrade-offsParameterEfficiencyMemoryEfficiencyModelPerfo......
  • 【LLM实战】 基于QLoRA对微软Phi-2进行对话摘要任务微调
    本文将在DialogSum数据集上使用2张T4卡对2.7B的microsoft/phi2进行LORA微调。博客翻译自Kaggle项目fine-tuning-llm-for-dialogue-summarizationhttps://www.kaggle.com/code/aisuko/fine-tuning-llm-for-dialogue-summarization一、安装依赖首先,安装依赖包%%capture!pip......
  • 【Coursera GenAI with LLM】 Week 2 Fine-tuning LLMs with instruction Class Notes
    GenAIProjectLifecycle:Afterpickingpre-trainedmodels,wecanfine-tune!In-contextlearning(ICL):zero/one/fewshotinference.Includingafewmodelsinthepromptformodeltolearnandgenerateabettercomplement(akaoutput).Itsdrawbacks......
  • AI推介-大语言模型LLMs论文速览(arXiv方向):2024.03.05-2024.03.10—(1)
    文章目录~1.EditingConceptualKnowledgeforLargeLanguageModels2.TRAD:EnhancingLLMAgentswithStep-WiseThoughtRetrievalandAlignedDecision3.AreYouBeingTracked?DiscoverthePowerofZero-ShotTrajectoryTracingwithLLMs!4.CanLLMSubstit......
  • 来自 AI Secure 实验室的 LLM 安全排行榜简介
    近来,LLM已深入人心,大有燎原之势。但在我们将其应用于千行百业之前,理解其在不同场景下的安全性和潜在风险显得尤为重要。为此,美国白宫发布了关于安全、可靠、可信的人工智能的行政命令;欧盟人工智能法案也对高风险人工智能系统的设立了专门的强制性要求。在这样的大背景下,我们首......
  • LLM 推理和应用 开源框架梳理
    之前对LLM推理和应用了解不多,因此抽时间梳理了一下,我们从模型量化,模型推理,以及开发平台等三个层面来梳理分析。模型量化模型训练时为了进度,采用的32位浮点数,因此占用的空间较大,一些大的模型需要很大的显存才能加载,且计算推理过程较慢。为了减少内存占用,提升推理速度,可以将高精......
  • Pacing guide is based on five 50 minute class sessions per week
    Pacingguideisbasedonfive50minuteclasssessionsperweekcorecontent     corecontent     capstone     explorations     optionalcontent     WEEK1 Session1Session2Session......