首页 > 其他分享 >AIGC-DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors-ECCV2024

AIGC-DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors-ECCV2024

时间:2024-07-22 21:55:58浏览次数:12  
标签:Diffusion DynamiCrafter mathbf Priors image context text model mathrm

论文:
https://arxiv.org/pdf/2310.12190
代码:
https://github.com/Doubiiu/DynamiCrafter?tab=readme-ov-file

MOTIVATION

  • Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content
  • VideoComposer [77] and I2VGen-XL are incompetent for image animation due to their less comprehensive image injection mechanisms, which results in either abrupt temporal changes or low visual conformity to the input image

在这里插入图片描述

CONTRIBUTION

  • We introduce an innovative approach for animating open-domain images by leveraging video diffusion prior, significantly outperforming contemporary competitors.
  • We conduct a comprehensive analysis on the conditional space of text-to-video diffusion models and propose a dual-stream image injection paradigm to achieve the challenging goal of image animation.
  • We pioneer the study of text-based motion control for open-domain image animation and demonstrate the proof of concept through preliminary experiments.

RELATED WORKS

Image Animation

  • Early physical simulation-based approaches focus on simulating the motion of specific objects, resulting in low generalizability due to the independent modeling of each object category.
  • To produce more realistic motion, reference-based methods transfer motion or appearance information from reference signals, such as videos, to the synthesis process.But the need for additional guidances limits their practical application.
  • A stream ofworks based on GAN can generate frames by perturbing initial latents or performing random walk in the latent vector space,but the generated motion is not plausible since the animated frames are just a visualization of the possible appearance space without temporal awareness
  • (Learned) Motion prior-based methods animate still images through explicit or implicit image-based rendering with estimated motion field or geometry priors. Similarly, video prediction predicts future video frames starting from single images by learning spatio-temporal priors from video data.
  • Video prediction predicts future video frames starting from single images by learning spatio-temporal priors from video data.
  • DRAWBACKS:
    • they primarily focus on animating motions in curated domains, particularly stochastic and oscillating motion
    • the animated objects are limited to specific categories, e.g., fluid, natural scenes , human hair , portraits , and bodies .

Video Diffusion Models

  • the first video diffusion model (VDM) is proposed to model low-resolution videos using a spacetime factorized U-Net in pixel space.
  • Imagen-Video presents effective cascaded DMs with v-prediction for generating high-definition videos.

METHODS

在这里插入图片描述

Preliminary: Video Diffusion Models

  • 视频编码(Encoding)

    • 给定一个视频 x x x,其维度表示为 L × 3 × H × W L \times 3 \times H \times W L×3×H×W,其中:
      • L L L 是视频的长度,即帧的数量。
      • 3 3 3 代表颜色通道(RGB)。
      • H H H 和 W W W 分别是帧的高度和宽度。
    • 视频首先被编码成潜在表示(latent representation) z z z,通过编码函数 E ( x ) E(x) E(x) 逐帧转换为 z z z,其中 z z z 的维度是 L × C × h × w L \times C \times h \times w L×C×h×w
      • C C C 是潜在空间的通道数
      • h h h 和 w w w 是潜在表示的高度和宽度。
  • 前向扩散过程(Forward Diffusion Process)

    • 在潜在空间中,前向扩散过程 z t = p ( z 0 , t ) z_t = p(z_0, t) zt​=p(z0​,t) 被执行。这个过程通过逐步添加噪声将数据样本 z 0 z_0 z0​ 转化为 z t z_t zt​,模拟数据从原始状态到高斯噪声的转换。
  • 后向去噪过程(Backward Denoising Process)

    • 与前向扩散过程相对应,后向去噪过程 z t = p θ ( z t − 1 , c , t ) z_t = p_\theta(z_{t-1}, c, t) zt​=pθ​(zt−1​,c,t) 通过去噪网络 ϵ θ ( z t , t ) \epsilon_\theta(z_t, t) ϵθ​(zt​,t) 从带噪声的输入 z t z_t zt​ 中恢复出较少噪声的数据 z t − 1 z_{t-1} zt−1​。这个过程是可学习的,由参数 θ \theta θ 表示。
    • 在去噪过程中, c c c 表示可能的去噪条件,这可以包括文本提示等语义信息,用于指导视频生成过程。
  • 视频解码(Decoding): 一旦去噪过程完成,潜在表示 z z z 通过解码函数 D ( z ) D(z) D(z) 被转换回视频空间,生成最终的视频 x ^ \hat{x} x^。

Image Dynamics from Video Diffusion Priors

Request for T2V Models

  • To animate a still image with the T2V generative priors, the visual information should be injected into the video generation process in a comprehensive manner.
  • The image should be digested by the T2V model for context understanding, which is important for dynamics synthesis

Text-aligned context representation.

To guide video generation with image context, we propose to project the image into a text-aligned embedding space

  • the pre-trained CLIP text encoder constructs the text embedding
  • the image encoder counterpart extracts image feature from the input image

Although the global semantic token f c l s f_{cls} fcls​ from the CLIP image encoder is well-aligned with image captions, it mainly represents the visual content at the semantic level and fails to capture the image’s full extent

  • To extract more complete information, we use the full visual tokens F v i s = { f i } i = 1 K F_{vis} = \{f_i\}_{i=1}^K Fvis​={fi​}i=1K​ from the last layer of the CLIP image ViT , which demonstrated high-fidelity in conditional image generation works .

  • To promote the alignment with text embedding(to obtain a context representation that can be interpreted by the denoising U-Net), we utilize a learnable lightweight model P P P to translate F v i s F_{vis} Fvis​ into the final context representation F c t x = P ( F v i s ) F_{ctx} = P(F_{vis}) Fctx​=P(Fvis​)

    • P P P:the query transformer architecture in multimodal fusion studies,which comprises N stacked layers of cross-attention and feed-forward networks (FFN), and is adept at cross-modal representation learning via the cross-attention mechanism.
    • F c t x F_{ctx} Fctx​:the final context representation(context embedding), F c t x = P ( F v i s ) F_{ctx} = P(F_{vis}) Fctx​=P(Fvis​)

    在这里插入图片描述

  • Subdsequently, the text embedding F t x t F_{txt} Ftxt​ and context embedding F c t x F_{ctx} Fctx​ are employed to interact with the U-Net inter mediate features F i n F_{in} Fin​ through the dual cross-attention layers:
    F o u t = S o f t m a x ( Q K t x t ⊤ d ) V t x t + λ ⋅ S o f t m a x ( Q K c t x ⊤ d ) V c t x , \mathbf{F}_{\mathrm{out}}=\mathrm{Softmax}(\frac{\mathbf{QK}_{\mathrm{txt}}^\top}{\sqrt{d}})\mathbf{V}_{\mathrm{txt}}+\lambda\cdot\mathrm{Softmax}(\frac{\mathbf{QK}_{\mathrm{ctx}}^\top}{\sqrt{d}})\mathbf{V}_{\mathrm{ctx}}, Fout​=Softmax(d ​QKtxt⊤​​)Vtxt​+λ⋅Softmax(d ​QKctx⊤​​)Vctx​,

    • Q = F i n W Q Q=F_{in}W_{Q} Q=Fin​WQ​
    • K t x t   =   F t x t W K K_{txt}~=~F_{txt}W_{K} Ktxt​ = Ftxt​WK​, V t x t   =   F t x t W V V_{txt}~=~F_{txt}W_{V} Vtxt​ = Ftxt​WV​
    • K c t x = F c t x W K ′ K_{ctx}=F_{ctx}W_{K}^{\prime} Kctx​=Fctx​WK′​, V c t x = F c t x W V ′ V_{ctx}=F_{ctx}W_{V}^{\prime} Vctx​=Fctx​WV′​
    • λ \lambda λ denotes the coefficient that fuses text-conditioned and image-conditioned features,which is achieved through tanh gating and adaptively learnable for each layers.

Discussion:

  • Why are text prompts necessary when a more informative context representation is provided?
    • a text-aligned context representation carries more extensive information than text embedding, which may overburden the T2V model to digest them properly
    • a still image typically contains multiple potential dynamic variations, text prompts can effectively guide the generation of dynamic content tailored to user preferences
  • Why is a rich context representation necessary when the visual guidance provides the complete image?
    • the pre-trained T2V model comprises a semantic control space (text embedding) and a complementary random space (initial noise).While the random space effectively integrates low-level information, concatenating the noise of each frame with a fixed image induces spatial misalignment potentially, which may misguide the model in uncontrollable directions.
    • the precise visual context supplied by the image embedding can assist in the reliable utilization of visual details

Observations and analysis of λ

在这里插入图片描述

  • increase λ λ λ leads to suppressed cross-frame movements
  • decrease λ λ λ poses challenges in preserving the object’s shape
  • As the intermediate layers of the U-Net are more associated with object shapes or poses, and the two-end layers are more linked to appearance , we expect that the image features will primarily influence the videos’ appearance while exerting relatively less impact on the shape.

Visual detail guidance (VDG)

在这里插入图片描述

  • PROBLEM(lack of visual conformity):minor discrepancies may still occur and this is mainly due to the pre-trained CLIP image encoder’s limited capability to fully preserve input image information, as it is designed to align visual and language features.
  • LDG:concatenate the conditional image with per-frame initial noise and feed them to the denoising U-Net as a form of guidance
  • During training, we randomly select a video frame as the image condition of the denoising process through the proposed dual-stream image injection mechanism to inherit visual details and digest the input image in a context-aware manner

Training Paradigm

The conditional image is integrated through two complementary streams, which play roles in context control and detail guidance, respectively. To modulate them in a cooperative manner, we device a dedicated training strategy consisting of three stages,

  • training the image context representation network P P P to extract text-aligned visual information from the input image
    • problem:P takes numerous optimization steps to converge
    • solution:train it based on a lightweight T2I model instead of a T2V model, allowing it to focus on image context learning
  • adapting P P P to the T 2 V T2V T2V model by jointly training P and spatial layers (in contrast to temporal layers) of the T2V model
  • joint fine-tuning with VDG.
    • After establishing a compatible context conditioning branch for T2V, we concatenate the input image with per-frame noise for joint fine-tuning to enhance visual conformity
    • we only fine-tune P and the VDM’s spatial layers to avoid disrupting the pre-trained T2V model’s temporal prior knowledge with dense image concatenation, which could lead to significant performance degradation and contradict our original intention.
    • we randomly select a video frame as the image condition based on two considerations:
      • to prevent the network from learning a shortcut that maps the concatenated image to a frame in the specific location
      • to force the context representation to be more flexible to avoid offering the over-rigid information for a specific frame, i.e., the objective in the context learning based on T2I.

Experiment

Implementation Details

Our development is based on the open-source T2V model VideoCrafter (@256 × 256 resolution) and T2I model Stable-Diffusion-v2.1 (SD) .
traning:

  • firstly train P and the newly injected image cross-attention layers based on SD, with 1000K steps on the learning rate 1 × 1 0 − 4 1 × 10^{−4} 1×10−4and valid mini-batch size 64.
  • Then we replace SD with VideoCrafter and further fine-tune P P P and spatial layers with 30K steps for adaptation, and additional 100K steps with image concatenation on the learning rate 5 × 1 0 − 5 5 × 10^{−5} 5×10−5 and valid minibatch size 64.
  • Our DynamiCrafter was trained on WebVid10M dataset by sampling 16 frames with dynamic FPS at the resolution of 256 × 256 in a batch

inference

  • adopt DDIM sampler with multi-condition classifierfree guidance
  • similar to video editing , we introduce two guidance scales s i m g s_{img} simg​ and s t x t s_{txt} stxt​ to text-conditioned image animation, which can be adjusted to trade off the impact of two control signals
    ϵ ^ θ ( z t , c i m g , c t x t ) = ϵ θ ( z t , ∅ , ∅ ) + s i m g ( ϵ θ ( z t , c i m g , ∅ ) − ϵ θ ( z t , ∅ , ∅ ) ) + s t x t ( ϵ θ ( z t , c i m g , c t x t ) − ϵ θ ( z t , c i m g , ∅ ) ) \begin{aligned} \mathbf{\hat{\epsilon}}_{\theta}\left(\mathbf{z}_{t},\mathbf{c}_{\mathrm{img}},\mathbf{c}_{\mathrm{txt}}\right)& =\epsilon_{\theta}\left(\mathbf{z}_{t},\emptyset,\emptyset\right) \\ &+s_{\mathrm{img}}(\epsilon_\theta\left(\mathbf{z}_t,\mathbf{c}_{\mathrm{img}},\varnothing\right)-\epsilon_\theta\left(\mathbf{z}_t,\varnothing,\varnothing\right)) \\ &+s_{\mathrm{txt}}(\epsilon_\theta\left(\mathbf{z}_t,\mathbf{c}_{\mathrm{img}},\mathbf{c}_{\mathrm{txt}}\right)-\epsilon_\theta\left(\mathbf{z}_t,\mathbf{c}_{\mathrm{img}},\varnothing\right)) \end{aligned} ϵ^θ​(zt​,cimg​,ctxt​)​=ϵθ​(zt​,∅,∅)+simg​(ϵθ​(zt​,cimg​,∅)−ϵθ​(zt​,∅,∅))+stxt​(ϵθ​(zt​,cimg​,ctxt​)−ϵθ​(zt​,cimg​,∅))​

Quantitative Evaluation

  1. 评估目的:评估合成视频在空间和时间维度上的质量与时间连贯性。

  2. 评估指标

    • Fréchet Video Distance (FVD):一种评估视频质量的度量,通过测量合成视频与真实视频之间的特征级差异。
    • Kernel Video Distance (KVD):另一种视频质量评估度量,使用核方法来比较视频。
    • 感知输入一致性 (Perceptual Input Conformity, PIC):为了进一步调查输入图像与动画结果之间的感知一致性,引入了PIC这一度量标准。PIC是通过以下公式计算的:
      PIC = 1 L ∑ l = 1 L ( 1 − D ( x in , x l ) ) \text{PIC} = \frac{1}{L} \sum_{l=1}^{L} (1 - D(x_{\text{in}}, x_l)) PIC=L1​l=1∑L​(1−D(xin​,xl​))
      • x in x_{\text{in}} xin​ 是输入图像
      • x l x_l xl​ 是视频帧
      • L L L 是视频长度
      • D ( ⋅ , ⋅ ) D(\cdot, \cdot) D(⋅,⋅)是感知距离度量函数。
    • 感知距离度量:使用DreamSim [19]作为感知距离度量函数 D D D,它评估两个视觉实体之间的感知相似性。
    • 零样本生成性能评估:在UCF-101和MSR-VTT数据集上评估所有方法的零样本(zero-shot)生成性能。零样本生成意味着模型在没有看过特定类别的样本的情况下生成视频。
  3. 数据集

    • UCF-101 [70]:一个广泛使用的动作识别数据集,包含101种不同的人类动作类别。
    • MSR-VTT [85]:一个视频描述数据集,包含视频和相应的描述。
  4. 评估设置:在分辨率为256×256的情况下,使用16帧的视频来评估每个错误度量。

  5. 根据结果,提出的方法在所有评估指标上都显著优于先前的方法,除了UCF-101上的KVD。这种性能提升归功于有效的双流图像注入设计,充分利用了视频扩散先验。
    在这里插入图片描述

Qualitative Evaluation

在这里插入图片描述

在这里插入图片描述

Discussions on Motion Control using Text

Captions in existing large-scale datasets often consist of a combination of a large number of scene descriptive words and less dynamic/motion descriptions, potentially causing the model to overlook dynamics/motions during learning
For image animation, the scene description is already included in the image condition, while the motion description should be treated as text condition to train the model in a decoupled manner, providing the model with stronger text-based control over dynamics.

Dataset construction

在这里插入图片描述

  1. 解耦训练的目的:为了训练模型使其能够根据文本提示生成动态内容,需要对模型进行解耦训练,即分别训练模型对场景描述和动态/运动描述的理解。

  2. 数据集构建:通过过滤和重新注释WebVid10M数据集来构建一个新的数据集。这个新数据集的特点是包含更纯净的动态描述词汇,例如“Man doing push-ups.”,以及类别标签,如“human”。

  3. 模型训练:使用新构建的数据集训练了一个名为DynamiCraterDCP的模型。

  4. 有效性验证:通过40个图像提示测试案例来验证DynamiCraterDCP模型的有效性。这些测试案例展示了具有多种潜在动作的人形图像,以及描述各种动作的提示(例如,“Man waving hands”和“Man clapping”)。

  5. 评估指标:使用平均CLIP相似度(CLIP-SIM)来衡量提示和视频结果之间的相似性。DynamiCraterDCP在CLIP-SIM得分上从0.17提高到0.19,表明模型性能有所提升。

  6. 与其他方法的比较:图9中的可视化比较显示,Gen2和PikaLabs不能使用文本来控制动作,而DynamiCrafter能够反映文本提示,并且在DynamiCrafterDCP中通过提出的解耦训练得到了进一步增强。

标签:Diffusion,DynamiCrafter,mathbf,Priors,image,context,text,model,mathrm
From: https://blog.csdn.net/hwjokcq/article/details/140602164

相关文章

  • Stable Diffusion原理与代码实例讲解
    StableDiffusion原理与代码实例讲解1.背景介绍1.1问题的由来在深入探讨StableDiffusion之前,让我们先了解其应用背景。StableDiffusion主要出现在扩散模型领域,特别是在生成对抗网络(GAN)、变分自编码器(VAE)以及自回归模型中。这些模型通常用于生成高质量的样本,例如图像......
  • Debian12 AMD 显卡 7900XT 安装使用 stable-diffusion-webui 笔记
    简介由于AMD官方没有提供Debian12的驱动和ROCM,只好安装Ubuntu20.04的驱动和ROCM,必要软件git和python3-venv。添加i386仓库sudodpkg--add-architecturei386&&\sudoaptupgrade-y&&\aptupgrade-y下载驱动安装程序到AMD官网下载Ubuntu20.04驱动......
  • 2024最新的AI绘画工具 Stable Diffusion 整合包安装教程,SD安装分享(附整合包)
    大家好,我是灵魂画师向阳自从AI绘画开始进入大众视野之后,AI绘画工具StableDiffusion技术以其创新的人工智能能力而著称,它拥有根据用户输入的文字描述来创造细致且富有表现力的图像的独特本领。SD不仅能够生成图像,还能执行图像修复、扩展以及在文本指导下的图像变换等多样......
  • AI绘画Stable Diffusion ,3种方法精确控制人物姿势,总有一种适合你
    前言在AI绘画软件stablediffusion中,控制人物姿势的方法有很多,最简单的方法是在提示词中加入动作提示词,比如Sit,walk,run(坐、走、跑)等,但如果想要精确控制人物姿势就比较难了,首先想要用语言精确描述一个姿势比较困难,另外stablediffusion生图姿势图就像抽盲盒一样具体有......
  • Stable Diffusion【进阶篇】:真人漫改之图生图实现
    所谓真人漫改,就是把一张真人的图片生成一张新的二次元的图片,在StableDiffusion中,有很多方式实现,其中通过图生图的方式是最常用的方式,大概1-3分钟就可以完成。本文我们系统的讲解一下。下面我们来详细看一下图生图实现真人漫改的具体实现方式。【第一步】:图生图图片上......
  • stable diffusion教程:固定同套衣服,一秒快速换脸
    哈喽今天教大家用sd,做封面上的圣斗士女郎。文章使用的AI工具SD整合包、各种模型插件、提示词、AI人工智能学习资料都已经打包好放在网盘中了,无需自行查找,有需要的小伙伴下方扫码自行获取。昨天一位网友私信我,让我用这张圣斗士铠甲,画几个美女壁纸。特别强调铠甲不能变。......
  • AI绘画Stable Diffusion常用插件合集
    StableDiffusion常用插件,我已经给大家整理好了,下方扫码自取就好。拥有这些SD常用插件,让您的图像生成和编辑过程更加强大、直观、多样化。以下插件集成了一系列增强功能,覆盖从自动补全提示词到高分辨率图像放大,从双语界面到无边图像浏览,为用户提供了无缝的StableDiffusi......
  • AI绘画小白福音!Stable Diffusion 保姆级教程
    大家好,今天,我们就来进行AI绘画的文生图实战。(文末附籽料)unsetunset文生图实战unsetunset模型安装后之后,就可以开始激动人心的AI文生图了,下面我们以文生图为例,一边操作一边讲解提示词的语法和分类,以及出图参数的作用和使用方法。在开始之前,我们再来回顾一下,我们的需求......
  • Stable Diffusion【进阶篇】:真人漫改之IP-Adapter实现
    大家好,今天我分享真人漫改实现方式:借助ControlNet的IP-Adapter控制模型,IP-Adapter控制模型是由腾讯研究院出品的一个新的ControlNet模型,关于该模型可以理解为图片提示词,类似于MD垫图的效果,但是比tagger标签器提取出图片的元素构成效果更好。它不仅参考图片的风格、光影特效......
  • Stable Diffusion ControlNet垫图:IP-Adapter实现图片风格迁移
    IP-Adapter实现的SD垫图功能对我们的图片处理非常有用,后面我们会进行一系列IP-Adapter的应用分享,通过具体的实例真正看到IP-Adapter的强大。文章使用的AI工具SD整合包、各种模型插件、提示词、AI人工智能学习资料都已经打包好放在网盘中了,无需自行查找,有需要的小伙伴下方扫......