论文:
https://arxiv.org/pdf/2310.12190
代码:
https://github.com/Doubiiu/DynamiCrafter?tab=readme-ov-file
MOTIVATION
- Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content
- VideoComposer [77] and I2VGen-XL are incompetent for image animation due to their less comprehensive image injection mechanisms, which results in either abrupt temporal changes or low visual conformity to the input image
CONTRIBUTION
- We introduce an innovative approach for animating open-domain images by leveraging video diffusion prior, significantly outperforming contemporary competitors.
- We conduct a comprehensive analysis on the conditional space of text-to-video diffusion models and propose a dual-stream image injection paradigm to achieve the challenging goal of image animation.
- We pioneer the study of text-based motion control for open-domain image animation and demonstrate the proof of concept through preliminary experiments.
RELATED WORKS
Image Animation
- Early physical simulation-based approaches focus on simulating the motion of specific objects, resulting in low generalizability due to the independent modeling of each object category.
- To produce more realistic motion, reference-based methods transfer motion or appearance information from reference signals, such as videos, to the synthesis process.But the need for additional guidances limits their practical application.
- A stream ofworks based on GAN can generate frames by perturbing initial latents or performing random walk in the latent vector space,but the generated motion is not plausible since the animated frames are just a visualization of the possible appearance space without temporal awareness
- (Learned) Motion prior-based methods animate still images through explicit or implicit image-based rendering with estimated motion field or geometry priors. Similarly, video prediction predicts future video frames starting from single images by learning spatio-temporal priors from video data.
- Video prediction predicts future video frames starting from single images by learning spatio-temporal priors from video data.
- DRAWBACKS:
- they primarily focus on animating motions in curated domains, particularly stochastic and oscillating motion
- the animated objects are limited to specific categories, e.g., fluid, natural scenes , human hair , portraits , and bodies .
Video Diffusion Models
- the first video diffusion model (VDM) is proposed to model low-resolution videos using a spacetime factorized U-Net in pixel space.
- Imagen-Video presents effective cascaded DMs with v-prediction for generating high-definition videos.
METHODS
Preliminary: Video Diffusion Models
-
视频编码(Encoding):
- 给定一个视频
x
x
x,其维度表示为
L
×
3
×
H
×
W
L \times 3 \times H \times W
L×3×H×W,其中:
- L L L 是视频的长度,即帧的数量。
- 3 3 3 代表颜色通道(RGB)。
- H H H 和 W W W 分别是帧的高度和宽度。
- 视频首先被编码成潜在表示(latent representation)
z
z
z,通过编码函数
E
(
x
)
E(x)
E(x) 逐帧转换为
z
z
z,其中
z
z
z 的维度是
L
×
C
×
h
×
w
L \times C \times h \times w
L×C×h×w
- C C C 是潜在空间的通道数
- h h h 和 w w w 是潜在表示的高度和宽度。
- 给定一个视频
x
x
x,其维度表示为
L
×
3
×
H
×
W
L \times 3 \times H \times W
L×3×H×W,其中:
-
前向扩散过程(Forward Diffusion Process):
- 在潜在空间中,前向扩散过程 z t = p ( z 0 , t ) z_t = p(z_0, t) zt=p(z0,t) 被执行。这个过程通过逐步添加噪声将数据样本 z 0 z_0 z0 转化为 z t z_t zt,模拟数据从原始状态到高斯噪声的转换。
-
后向去噪过程(Backward Denoising Process):
- 与前向扩散过程相对应,后向去噪过程 z t = p θ ( z t − 1 , c , t ) z_t = p_\theta(z_{t-1}, c, t) zt=pθ(zt−1,c,t) 通过去噪网络 ϵ θ ( z t , t ) \epsilon_\theta(z_t, t) ϵθ(zt,t) 从带噪声的输入 z t z_t zt 中恢复出较少噪声的数据 z t − 1 z_{t-1} zt−1。这个过程是可学习的,由参数 θ \theta θ 表示。
- 在去噪过程中, c c c 表示可能的去噪条件,这可以包括文本提示等语义信息,用于指导视频生成过程。
-
视频解码(Decoding): 一旦去噪过程完成,潜在表示 z z z 通过解码函数 D ( z ) D(z) D(z) 被转换回视频空间,生成最终的视频 x ^ \hat{x} x^。
Image Dynamics from Video Diffusion Priors
Request for T2V Models
- To animate a still image with the T2V generative priors, the visual information should be injected into the video generation process in a comprehensive manner.
- The image should be digested by the T2V model for context understanding, which is important for dynamics synthesis
Text-aligned context representation.
To guide video generation with image context, we propose to project the image into a text-aligned embedding space
- the pre-trained CLIP text encoder constructs the text embedding
- the image encoder counterpart extracts image feature from the input image
Although the global semantic token f c l s f_{cls} fcls from the CLIP image encoder is well-aligned with image captions, it mainly represents the visual content at the semantic level and fails to capture the image’s full extent
-
To extract more complete information, we use the full visual tokens F v i s = { f i } i = 1 K F_{vis} = \{f_i\}_{i=1}^K Fvis={fi}i=1K from the last layer of the CLIP image ViT , which demonstrated high-fidelity in conditional image generation works .
-
To promote the alignment with text embedding(to obtain a context representation that can be interpreted by the denoising U-Net), we utilize a learnable lightweight model P P P to translate F v i s F_{vis} Fvis into the final context representation F c t x = P ( F v i s ) F_{ctx} = P(F_{vis}) Fctx=P(Fvis)
- P P P:the query transformer architecture in multimodal fusion studies,which comprises N stacked layers of cross-attention and feed-forward networks (FFN), and is adept at cross-modal representation learning via the cross-attention mechanism.
- F c t x F_{ctx} Fctx:the final context representation(context embedding), F c t x = P ( F v i s ) F_{ctx} = P(F_{vis}) Fctx=P(Fvis)
-
Subdsequently, the text embedding F t x t F_{txt} Ftxt and context embedding F c t x F_{ctx} Fctx are employed to interact with the U-Net inter mediate features F i n F_{in} Fin through the dual cross-attention layers:
F o u t = S o f t m a x ( Q K t x t ⊤ d ) V t x t + λ ⋅ S o f t m a x ( Q K c t x ⊤ d ) V c t x , \mathbf{F}_{\mathrm{out}}=\mathrm{Softmax}(\frac{\mathbf{QK}_{\mathrm{txt}}^\top}{\sqrt{d}})\mathbf{V}_{\mathrm{txt}}+\lambda\cdot\mathrm{Softmax}(\frac{\mathbf{QK}_{\mathrm{ctx}}^\top}{\sqrt{d}})\mathbf{V}_{\mathrm{ctx}}, Fout=Softmax(d QKtxt⊤)Vtxt+λ⋅Softmax(d QKctx⊤)Vctx,- Q = F i n W Q Q=F_{in}W_{Q} Q=FinWQ
- K t x t = F t x t W K K_{txt}~=~F_{txt}W_{K} Ktxt = FtxtWK, V t x t = F t x t W V V_{txt}~=~F_{txt}W_{V} Vtxt = FtxtWV
- K c t x = F c t x W K ′ K_{ctx}=F_{ctx}W_{K}^{\prime} Kctx=FctxWK′, V c t x = F c t x W V ′ V_{ctx}=F_{ctx}W_{V}^{\prime} Vctx=FctxWV′
- λ \lambda λ denotes the coefficient that fuses text-conditioned and image-conditioned features,which is achieved through tanh gating and adaptively learnable for each layers.
Discussion:
- Why are text prompts necessary when a more informative context representation is provided?
- a text-aligned context representation carries more extensive information than text embedding, which may overburden the T2V model to digest them properly
- a still image typically contains multiple potential dynamic variations, text prompts can effectively guide the generation of dynamic content tailored to user preferences
- Why is a rich context representation necessary when the visual guidance provides the complete image?
- the pre-trained T2V model comprises a semantic control space (text embedding) and a complementary random space (initial noise).While the random space effectively integrates low-level information, concatenating the noise of each frame with a fixed image induces spatial misalignment potentially, which may misguide the model in uncontrollable directions.
- the precise visual context supplied by the image embedding can assist in the reliable utilization of visual details
Observations and analysis of λ
- increase λ λ λ leads to suppressed cross-frame movements
- decrease λ λ λ poses challenges in preserving the object’s shape
- As the intermediate layers of the U-Net are more associated with object shapes or poses, and the two-end layers are more linked to appearance , we expect that the image features will primarily influence the videos’ appearance while exerting relatively less impact on the shape.
Visual detail guidance (VDG)
- PROBLEM(lack of visual conformity):minor discrepancies may still occur and this is mainly due to the pre-trained CLIP image encoder’s limited capability to fully preserve input image information, as it is designed to align visual and language features.
- LDG:concatenate the conditional image with per-frame initial noise and feed them to the denoising U-Net as a form of guidance
- During training, we randomly select a video frame as the image condition of the denoising process through the proposed dual-stream image injection mechanism to inherit visual details and digest the input image in a context-aware manner
Training Paradigm
The conditional image is integrated through two complementary streams, which play roles in context control and detail guidance, respectively. To modulate them in a cooperative manner, we device a dedicated training strategy consisting of three stages,
- training the image context representation network
P
P
P to extract text-aligned visual information from the input image
- problem:P takes numerous optimization steps to converge
- solution:train it based on a lightweight T2I model instead of a T2V model, allowing it to focus on image context learning
- adapting P P P to the T 2 V T2V T2V model by jointly training P and spatial layers (in contrast to temporal layers) of the T2V model
- joint fine-tuning with VDG.
- After establishing a compatible context conditioning branch for T2V, we concatenate the input image with per-frame noise for joint fine-tuning to enhance visual conformity
- we only fine-tune P and the VDM’s spatial layers to avoid disrupting the pre-trained T2V model’s temporal prior knowledge with dense image concatenation, which could lead to significant performance degradation and contradict our original intention.
- we randomly select a video frame as the image condition based on two considerations:
- to prevent the network from learning a shortcut that maps the concatenated image to a frame in the specific location
- to force the context representation to be more flexible to avoid offering the over-rigid information for a specific frame, i.e., the objective in the context learning based on T2I.
Experiment
Implementation Details
Our development is based on the open-source T2V model VideoCrafter (@256 × 256 resolution) and T2I model Stable-Diffusion-v2.1 (SD) .
traning:
- firstly train P and the newly injected image cross-attention layers based on SD, with 1000K steps on the learning rate 1 × 1 0 − 4 1 × 10^{−4} 1×10−4and valid mini-batch size 64.
- Then we replace SD with VideoCrafter and further fine-tune P P P and spatial layers with 30K steps for adaptation, and additional 100K steps with image concatenation on the learning rate 5 × 1 0 − 5 5 × 10^{−5} 5×10−5 and valid minibatch size 64.
- Our DynamiCrafter was trained on WebVid10M dataset by sampling 16 frames with dynamic FPS at the resolution of 256 × 256 in a batch
inference
- adopt DDIM sampler with multi-condition classifierfree guidance
- similar to video editing , we introduce two guidance scales
s
i
m
g
s_{img}
simg and
s
t
x
t
s_{txt}
stxt to text-conditioned image animation, which can be adjusted to trade off the impact of two control signals
ϵ ^ θ ( z t , c i m g , c t x t ) = ϵ θ ( z t , ∅ , ∅ ) + s i m g ( ϵ θ ( z t , c i m g , ∅ ) − ϵ θ ( z t , ∅ , ∅ ) ) + s t x t ( ϵ θ ( z t , c i m g , c t x t ) − ϵ θ ( z t , c i m g , ∅ ) ) \begin{aligned} \mathbf{\hat{\epsilon}}_{\theta}\left(\mathbf{z}_{t},\mathbf{c}_{\mathrm{img}},\mathbf{c}_{\mathrm{txt}}\right)& =\epsilon_{\theta}\left(\mathbf{z}_{t},\emptyset,\emptyset\right) \\ &+s_{\mathrm{img}}(\epsilon_\theta\left(\mathbf{z}_t,\mathbf{c}_{\mathrm{img}},\varnothing\right)-\epsilon_\theta\left(\mathbf{z}_t,\varnothing,\varnothing\right)) \\ &+s_{\mathrm{txt}}(\epsilon_\theta\left(\mathbf{z}_t,\mathbf{c}_{\mathrm{img}},\mathbf{c}_{\mathrm{txt}}\right)-\epsilon_\theta\left(\mathbf{z}_t,\mathbf{c}_{\mathrm{img}},\varnothing\right)) \end{aligned} ϵ^θ(zt,cimg,ctxt)=ϵθ(zt,∅,∅)+simg(ϵθ(zt,cimg,∅)−ϵθ(zt,∅,∅))+stxt(ϵθ(zt,cimg,ctxt)−ϵθ(zt,cimg,∅))
Quantitative Evaluation
-
评估目的:评估合成视频在空间和时间维度上的质量与时间连贯性。
-
评估指标:
- Fréchet Video Distance (FVD):一种评估视频质量的度量,通过测量合成视频与真实视频之间的特征级差异。
- Kernel Video Distance (KVD):另一种视频质量评估度量,使用核方法来比较视频。
- 感知输入一致性 (Perceptual Input Conformity, PIC):为了进一步调查输入图像与动画结果之间的感知一致性,引入了PIC这一度量标准。PIC是通过以下公式计算的:
PIC = 1 L ∑ l = 1 L ( 1 − D ( x in , x l ) ) \text{PIC} = \frac{1}{L} \sum_{l=1}^{L} (1 - D(x_{\text{in}}, x_l)) PIC=L1l=1∑L(1−D(xin,xl))- x in x_{\text{in}} xin 是输入图像
- x l x_l xl 是视频帧
- L L L 是视频长度
- D ( ⋅ , ⋅ ) D(\cdot, \cdot) D(⋅,⋅)是感知距离度量函数。
- 感知距离度量:使用DreamSim [19]作为感知距离度量函数 D D D,它评估两个视觉实体之间的感知相似性。
- 零样本生成性能评估:在UCF-101和MSR-VTT数据集上评估所有方法的零样本(zero-shot)生成性能。零样本生成意味着模型在没有看过特定类别的样本的情况下生成视频。
-
数据集:
- UCF-101 [70]:一个广泛使用的动作识别数据集,包含101种不同的人类动作类别。
- MSR-VTT [85]:一个视频描述数据集,包含视频和相应的描述。
-
评估设置:在分辨率为256×256的情况下,使用16帧的视频来评估每个错误度量。
-
根据结果,提出的方法在所有评估指标上都显著优于先前的方法,除了UCF-101上的KVD。这种性能提升归功于有效的双流图像注入设计,充分利用了视频扩散先验。
Qualitative Evaluation
Discussions on Motion Control using Text
Captions in existing large-scale datasets often consist of a combination of a large number of scene descriptive words and less dynamic/motion descriptions, potentially causing the model to overlook dynamics/motions during learning
For image animation, the scene description is already included in the image condition, while the motion description should be treated as text condition to train the model in a decoupled manner, providing the model with stronger text-based control over dynamics.
Dataset construction
-
解耦训练的目的:为了训练模型使其能够根据文本提示生成动态内容,需要对模型进行解耦训练,即分别训练模型对场景描述和动态/运动描述的理解。
-
数据集构建:通过过滤和重新注释WebVid10M数据集来构建一个新的数据集。这个新数据集的特点是包含更纯净的动态描述词汇,例如“Man doing push-ups.”,以及类别标签,如“human”。
-
模型训练:使用新构建的数据集训练了一个名为DynamiCraterDCP的模型。
-
有效性验证:通过40个图像提示测试案例来验证DynamiCraterDCP模型的有效性。这些测试案例展示了具有多种潜在动作的人形图像,以及描述各种动作的提示(例如,“Man waving hands”和“Man clapping”)。
-
评估指标:使用平均CLIP相似度(CLIP-SIM)来衡量提示和视频结果之间的相似性。DynamiCraterDCP在CLIP-SIM得分上从0.17提高到0.19,表明模型性能有所提升。
-
与其他方法的比较:图9中的可视化比较显示,Gen2和PikaLabs不能使用文本来控制动作,而DynamiCrafter能够反映文本提示,并且在DynamiCrafterDCP中通过提出的解耦训练得到了进一步增强。