首页 > 其他分享 >Enhancing Diffusion Models with Reinforcement Learning

Enhancing Diffusion Models with Reinforcement Learning

时间:2024-07-24 23:30:17浏览次数:6  
标签:Diffusion training Models self Reinforcement RLHF DDPO model reward

Enhancing Diffusion Models with Reinforcement Learning

TL;DR

Today we're going to tell you all about DRLX - our library for Diffusion Reinforcement Learning! Released a few weeks ago, DRLX is a library for scalable distributed training of diffusion models with reinforcement learning. It currently implements the recent DDPO paper, with more RL implementations coming soon. We share here preliminary results training Stable Diffusion v2.1 with PickScore, observing better prompt adherence and image quality. Contributions and manpower for helping with additional features and experiments are always greatly appreciated, so if you're interested, please join us in the #drlx channel of our Discord server.

Introduction

Over the last year, Large Language Models (LLMs) have gained popularity in part due to the training of instruction-tuning, preference learning, and chat models. This is often done with Reinforcement Learning from Human Feedback (RLHF), popularized by OpenAI with their work on InstructGPT and ChatGPT.  RLHF is typically implemented by training a reward model on a database of human ratings/preferences over LM generations, and then using RL to optimize the reward yielded by the LM's generations. While some industry labs have established workflows and infrastructure internally, RLHF still remains inaccessible to many other companies, institutions, and labs. CarperAI has been focused on democratizing RLHF through the development of open-source libraries like trlX, as well as sharing research insights regarding the training of LLMs with RLHF.

A natural question to ask regarding RLHF is if it can be applied to other model frameworks. Concurrent with the rise of LLMs has been the advances in image generation with AI, specifically with diffusion models. While text-to-image diffusion models have provided impressive results, often the generations fully capture the prompts and satisfy human preferences. So the question arises if RLHF can be applied to train diffusion models and better capture human preferences.

In May, a recent paper from the Levine Lab at UC Berkeley demonstrated how proximal policy optimization (PPO), the popular RL algorithm used in most implementations of RLHF, can be applied to train diffusion models. Their algorithm, called Denoising Diffusion Policy Optimization (DDPO), showed promising results with a variety of simple reward models. 

This paper has led to significant interest in diffusion RLHF. Once again in line with our mission to democratize access to RLHF, we have released DRLX (Diffusion Reinforcement Learning X), a library for modular RLHF training of diffusion models. While it currently only implements DDPO, we are adding implementations of other RL algorithms.

Reward Models

A preliminary ingredient for RLHF is a reward model. At first, it's likely not clear why we want to be predicting "human preference" as a single number. The loss function for a reward model in the style of InstructGPT is as follows:

Put in simple terms, the output of a reward model trained on pairwise comparisons can be thought of as log odds. Specifically, the difference in outputs between two generations (images in our case), can be thought of as the log-odds that the first would be chosen over the second by a human. An individual score can be thought of as the log-odds an image has of being chosen by a human over any other particular image, so an image with a large reward output is more likely to be preferred by a human over other images with lower rewards.

DDPO algorithm

Before going over how DDPO works, let’s think about how standard RLHF of language models works. The goal of RL algorithms is to optimize some reward. For RLHF, this is typically the output of a reward model that outputs a preference score given the output of a generative model.. The reward model is trained on tens of thousands of human preferences over LM generations. The language model generation is treated as a Markov Decision Process (MDP) where the state is the previous tokens, the policy is the language model, and the action is the predicted next token. With this setup, policy gradient methods like REINFORCE and PPO can be easily applied. Policy gradient methods enable us to directly optimize a model with an estimator of the gradient of the model with respect to the reward function. Note that unlike many standard RL setups, the reward is only given at the end of a trajectory, when the full LM generation is available.

DDPO also formulates the diffusion model as an MDP, which allows the application of the policy gradients methods for RLHF tuning of the diffusion models. The DDPO paper describes both the use of REINFORCE (referred to as DDPOSF in the paper) and PPO (referred to as DDPOIS in the paper). Given the significant improvement of performance of DDPOIS over DDPOSF (Figure 4 in the DDPO paper) we only implement the DDPOSF algorithm in DRLX.

The loss function for DDPOIS is given as follows:

Where  is the diffusion model and the reward r(...) is given by the reward model, for which we pass the image and possibly a contextualizing prompt as input.

Additionally, the ratio in the loss function is clipped to prevent the current diffusion model from diverging too much and leading to a bad model (this is one of the key properties of the PPO algorithm).

For a deeper dive into the DDPO algorithm and how it is implemented, especially if you are new to RL, please check out this blog post for more information. 

Note that there are a couple differences between DDPO and standard application of PPO to RLHF:

  1. DDPO does not have a value funcion baseline, while RLHF of LMs typically implements this by having a separate head on the LM, and uses generalized advantage estimation.
  2. RLHF as implemented by OpenAI performs KL regularization on the reward function, whereas DDPO does not.

These differences could provide directions for future work.

Results

LAION Aesthetic Score

Similar to the original paper, we train a Stable Diffusion model using DDPO treating the LAION aesthetic score as a reward model. Unlike the original paper though, here we use Stable Diffusion 2.1 (we use the base version to work with 512x512 images).

The LAION aesthetic classifier was trained on human ratings collected for images generated in the SimulacraBot Discord server with latent diffusion models that served as precursors to Stable Diffusion. It remains one of the most commonly used aesthetic scorers, despite significant issues with the dataset and model. Nevertheless, it serves as a useful baseline for our RLHF experiments here. 

Using DRLX to train this model is trivial. First we need a prompt distribution. Here we use random ImageNet animal classes as prompts, just like the original DDPO paper.

class ImagenetAnimalPrompts(PromptPipeline):
    """
    Pipeline of prompts consisting of animals from ImageNet, as used in the original `DDPO paper `_.
    """
    def __init__(self, num=10000, *args, **kwargs):
        super().__init__(*args, **kwargs)
        r = requests.get("https://raw.githubusercontent.com/formigone/tf-imagenet/master/LOC_synset_mapping.txt")
        with open("LOC_synset_mapping.txt", "wb") as f: f.write(r.content)
        self.synsets = {k:v for k,v in [o.split(',')[0].split(' ', maxsplit=1) for o in Path('LOC_synset_mapping.txt').read_text().splitlines()]}
        self.imagenet_classes = list(self.synsets.values())
        self.num = num


    def __getitem__(self, index):
        animal = random.choice(self.imagenet_classes[:397])
        return f'{animal}'
   
    def __len__(self):
        'Denotes the total number of samples'
        return self.num

You can see we simply have to define our ImageNet classes (we download it from a text file online) and randomly choose one from the top 398 classes which are the animal classes, and return that as our prompt!

The next thing to set up is our reward function. We use the standard inference code for the aesthetic classifier, which you can see here.

Then we can define a RewardModel class for the aesthetic predictor:

class Aesthetics(RewardModel):
    """
    Reward model that rewards images with higher aesthetic score. Uses CLIP and an MLP (not put on any device by default)


    :param device: Device to load model on
    :type device: torch.device
    """
    def __init__(self, device = None):
        super().__init__()
        self.model = MLP(768)
        self.model.load_state_dict(load_aesthetic_model_weights())
        self.clip_model, self.preprocess = clip.load("ViT-L/14", device=device if device is not None else 'cpu')


        if device is not None:
            self.model.to(device)


    def forward(self, images : np.ndarray, prompts : Iterable[str]):
        return aesthetic_scoring(
            images,
            self.preprocess,
            self.clip_model,
            self.model
        )

As you can see, a very simple setup. We then need to set up a simple config file (you can see the defaults in configs.py)

method:
  name : "DDPO"


model:
  model_path: "stabilityai/stable-diffusion-2-1-base"
  model_arch_type: "LDMUnet"
  attention_slicing: True
  xformers_memory_efficient: True
  gradient_checkpointing: True


sampler:
  num_inference_steps: 50


optimizer:
  name: "adamw"
  kwargs:
    lr: 1.0e-5
    weight_decay: 1.0e-4
    betas: [0.9, 0.999]


scheduler:
  name: "linear" # Name of learning rate scheduler
  kwargs:
    start_factor: 1.0
    end_factor: 1.0
 
logging:
  run_name: 'ddpo_sd_imagenet'
  wandb_entity: 'tmabraham'
  wandb_project: 'DRLX'


train:
  num_epochs: 200
  num_samples_per_epoch: 256
  batch_size: 4
  sample_batch_size: 32
  grad_clip: 1.0
  checkpoint_interval: 50
  tf32: True
  suppress_log_keywords: "diffusers.pipelines,transformers"
  save_samples: False

Note the use of various performance optimizations already implemented in PyTorch or HuggingFace Diffusers. This includes tf32 support on Ampere GPUs, attention slicing, memory-efficient attention, and gradient checkpointing.

Also note that while SD 2.1 is a v-prediction model, we do not have to specify this manually and the HuggingFace Diffusers schedulers handle this automatically for us.

With all of that out of the way, training with DRLX is extremely simple:

import torch
from drlx.trainer.ddpo_trainer import DDPOTrainer
from drlx.configs import DRLXConfig
from drlx.reward_modelling.aesthetics import Aesthetics
from drlx.pipeline.imagenet_animal_prompts import ImagenetAnimalPrompts


config = DRLXConfig.load_yaml("configs/ddpo_sd_imagenet.yml")
pipe = ImagenetAnimalPrompts(num=config.train.num_samples_per_epoch)
trainer = DDPOTrainer(config)
trainer.train(pipe, Aesthetics())

As you can see, our DDPOTrainer class handles everything! It just needs a prompt pipeline, a reward function, and a config. After training for 200 epochs with 256 samples per epoch, we get the following reward curve:

A consistently increasing mean reward score is exactly what we want. Here is a comparison of a few samples before and after training on aesthetic score.

Notice how the model settles on a specific redish-orangish color palette for the images, as well as a narrow depth-of-focus effect, blurring the background. The subject tends to be centered, though not always. These effects of RLHF training are more apparent when observing even more samples:

I’d argue that these images certainly are more aesthetically pleasing than those generated by the base model, but still feel boring in some aspects, and lack diversity. 

It is interesting to observe how a sample changes over the course of training (consistent seed):

For example, these visualizations demonstrate that the unique color palette only starts to appear towards the end of training.

RL Training with PickScore

One particular issue with using LAION aesthetic classifier as a reward model is that it does not take into consideration the prompt. This means that training to optimize the LAION aesthetic score could give us a model that always outputs the same image with an extremely high score with no regard for the prompt. In fact, in earlier experiments with Stable Diffusion v1.4, we did observe this behavior, where a noisy adversarial pattern was giving scores of ~7.1:

This behavior is sometimes referred to as reward hacking in the RL literature. In order to address this challenge, it would be ideal to ensure that the generated images are aligned well with the provided prompt, and optimize against this as well. Recently, researchers have been developing preference scores for AI-generated image that focus on prompt alignment as well. PickScore is one such preference model, which was developed in collaboration with Stability AI researchers. It is a CLIP-ViT-H finetuned on the Pick-A-Pic dataset of user preferences over Stable Diffusion XL preview beta images. It utilizes the CLIP text encoder as well, unlike LAION aesthetic predictor, which enables the PickScore model to better capture prompt alignment. For this reason, we tested RL training of Stable Diffusion 2.1 with PickScore as the reward model on the Pick-a-Pic dataset.

Our training code is set up in a similar manner to the LAION aesthetic classifier-based training:

import torch
from drlx.trainer.ddpo_trainer import DDPOTrainer
from drlx.configs import DRLXConfig
from drlx.pipeline.pickapic_prompts import PickAPicPrompts, PickAPicReplacementPrompts
from drlx.reward_modelling.pickscore import PickScoreModel
from drlx.reward_modelling.aesthetics import Aesthetics


pipe = PickAPicPrompts()
config = DRLXConfig.load_yaml("configs/ddpo_sd_pickapic.yml")
trainer = DDPOTrainer(config)
trainer.train(pipe, PickScoreModel())

We train the model with a guidance scale of 7.5, for 450 epochs, with 256 samples per epoch and a batch size of 16 (divided over 4 80GB A100 GPUs, so a batch size of 4 per GPU). Again we utilize similar performance optimizations as before.

We once again observe a consistently increasing mean reward score. Below you can see a few examples of generated images from different checkpoints during training:

While not all examples show a demonstrable improvement, we do see in some instances both improved prompt alignment and better aesthetic quality. For example, in the “A panda bear as a mad scientist” example, we see that at the end of training the panda arguably looks more “mad” and is handling glassware as if it’s a scientist. The “jedi duck” is also quite a good example, where there is a much clearer depiction of the duck holding a lightsaber. On the other hand, the generation of the final model for the prompt “a golden retriever jumping over a box” fails to actually show a golden retriever jumping over a box, but it is still quite an aesthetic image.

Once again, we can observe how a sample changes over the course of training (consistent seed):

We have released the weights of our DDPO-trained SD 2.1 with PickScore on Pick-A-Pic prompts to HuggingFace over here.

Discussion and Conclusion

Using RLHF for improvement of image generation models is a burgeoning field and there is lots of work left to do. On the side of the generative model, there are lots of different RL algorithms to be  tried (DPO, offline RL, reST, etc.) and tweaks that could be made to the RL setup. On the reward model side, simply using CLIP is obviously very simplistic and would be insufficient for an instruct language model, so perhaps something new is in order for an image reward model. There is also a need to translate different advancements from instruct language generation to diffusion for images and audio, with new algorithms coming out for the former task every few weeks. If you'd like to join us on the journey of bringing RLHF up to speed for image generation, feel free to to say hello in the #drlx channel, and check the GitHub issues for tasks to help out with!

Enhancing Diffusion Models with Reinforcement Learning | CarperAI

标签:Diffusion,training,Models,self,Reinforcement,RLHF,DDPO,model,reward
From: https://www.cnblogs.com/zhangxianrong/p/18322005

相关文章

  • 《昇思 25 天学习打卡营第 18 天 | 扩散模型(Diffusion Models) 》
    《昇思25天学习打卡营第18天|扩散模型(DiffusionModels)》活动地址:https://xihe.mindspore.cn/events/mindspore-training-camp签名:Sam9029扩散模型(DiffusionModels)扩散模型概述扩散模型(DiffusionModels),特别是去噪扩散概率模型(DDPM),在图像、音频、视频生成领......
  • Modelsim仿真时报错Error (vlog-7) Failed to open design unit file XXXXX in read m
    根据之前遇到的问题Vivado关联Modelsim仿真时一直卡在Executinganalysisandcompilationstep…的解决我在TclConsole窗口查看出现的Error信息。出现报错:Error:(vlog-7)Failedtoopendesignunitfile“XXXXX”inreadmode查阅网上解决办法VivadoError:(vlog-......
  • [米联客-安路飞龙DR1-FPSOC] FPGA基础篇连载-03安路TD结合modelsim仿真
    软件版本:Anlogic-TD5.9.1-DR1_ES1.1操作系统:WIN1064bit硬件平台:适用安路(Anlogic)FPGA实验平台:米联客-MLK-L1-CZ06-DR1M90G开发板板卡获取平台:https://milianke.tmall.com/登录"米联客"FPGA社区http://www.uisrc.com视频课程、答疑解惑!1概述FPGA开发中对设计的代码功能......
  • AIGC-DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors-ECC
    论文:https://arxiv.org/pdf/2310.12190代码:https://github.com/Doubiiu/DynamiCrafter?tab=readme-ov-fileMOTIVATIONTraditionalimageanimationtechniquesmainlyfocusonanimatingnaturalsceneswithstochasticdynamics(e.g.cloudsandfluid)ordom......
  • 模型需要永远加载[tf.keras.models.load_model]
    我试图在VisualStudio代码上本地加载模型,但是当我运行脚本时,它卡在加载模型行。我已经使用cpu在googlecollab上成功运行了脚本,所以我不认为这是计算能力不足的原因,并且在运行脚本时,我的cpu仅使用25%的容量。有人知道为什么会发生这种情况吗?model=tf.keras.models......
  • Large Language Models: A Survey
    LargeLanguageModels:ASurveyhttps://arxiv.org/pdf/2402.06196 导读https://developer.volcengine.com/articles/7389111961612451866 1、介绍了语言建模的历史背景,从统计语言模型到神经网络模型,再到预训练语言模型和LLMs的发展。流行语言模型类型:Encoder-Only、D......
  • Stable Diffusion原理与代码实例讲解
    StableDiffusion原理与代码实例讲解1.背景介绍1.1问题的由来在深入探讨StableDiffusion之前,让我们先了解其应用背景。StableDiffusion主要出现在扩散模型领域,特别是在生成对抗网络(GAN)、变分自编码器(VAE)以及自回归模型中。这些模型通常用于生成高质量的样本,例如图像......
  • Debian12 AMD 显卡 7900XT 安装使用 stable-diffusion-webui 笔记
    简介由于AMD官方没有提供Debian12的驱动和ROCM,只好安装Ubuntu20.04的驱动和ROCM,必要软件git和python3-venv。添加i386仓库sudodpkg--add-architecturei386&&\sudoaptupgrade-y&&\aptupgrade-y下载驱动安装程序到AMD官网下载Ubuntu20.04驱动......
  • 【论文阅读笔记】大模型微调——《LoRA: Low-Rank Adaptation of Large Language Mode
    论文地址:https://arxiv.org/abs/2106.09685代码地址:https://github.com/microsoft/LoRA目录IntroductionMotivationMethodExperimentsFurtherdiscusion感想IntroductionMotivationMethodExperimentsFurtherdiscusion个人认为本篇的精髓所在,能够引用5k+的原因之一感......
  • ChatRule: Mining Logical Rules with Large Language Models for Knowledge Graph Re
    文章目录题目摘要引言方法实验消融研究题目ChatRule:利用大型语言模型挖掘逻辑规则进行知识图推理论文地址:https://arxiv.org/abs/2309.01538摘要    逻辑规则对于揭示关系之间的逻辑联系至关重要,这可以提高推理性能并在知识图谱(KG)上提供可解释的结果。尽......