# Action-Conditioned 3D Human Motion Synthesis With Transformer VAE #paper
1. paper-info
1.1 Metadata
- Author:: [[Mathis Petrovich]], [[Michael J. Black]], [[Gül Varol]]
- 作者机构::
- Keywords:: #HMP , #Transformer , #CVAE
- Journal:: #CVPR
- Date:: [[2021]]
- 状态:: #Done
- 链接:: https://openaccess.thecvf.com/content/ICCV2021/html/Petrovich_Action-Conditioned_3D_Human_Motion_Synthesis_With_Transformer_VAE_ICCV_2021_paper.html?ref=https://githubhelp.com
- 修改时间:: 2022.11.6
1.2. Abstract
We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences
. In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence. Here we learn an action-aware latent representation
for human motions by training a generative variational autoencoder (VAE)
. By sampling from this latent space and querying a certain duration through a series of positional encodings, we synthesize variable-length motion sequences conditioned on a categorical action. Specifically, we design a Transformer-based architecture
, ACTOR
, for encoding and decoding a sequence of parametric SMPL
human body models estimated from action recognition datasets. We evaluate our approach on the NTU RGB+D
, HumanAct12
and UESTC
datasets and show improvements over the state of the art. Furthermore, we present two use cases: improving action recognition through adding our synthesized data to training, and motion denoising.
2. Introduction
- 领域:
- 给定一个语义操作标签
semantic action label
,生成语义对应的动作序列。不同于之前的动作生成任务,之前的动作生成任务都是在给定历史动作序列去预测未来动作序列。该领域在虚拟现实和角色控制中有所应用。
- 给定一个语义操作标签
- 作者的方法:
an action-conditioned generative model
a Transformer-based encoder-decoder architecture
VAE
SMPL
:可以通过关节点或者人体表面描点来描述人体结构。- 能够一次性输出全部的人体动作序列。
3. Action-Conditioned Motion Generation
Problem definition
人体行为可以通过人身体部位的旋转来定义,并且独立于人体的体型。为了能够产生具有不同形态学的人体动作,需要将人体姿势与人体体型解耦。为了这个问题,作者使用了SMPL
模型。
定义:
行为标签:\(a \in A\),\(A\)表示行为分类集合。
人体姿势:\(R_1,..,R_T\)
根关节点的平移顺序:\(D_1,..,D_T;D_T\in \mathbb{R}^3\)
Motion representation
SMPL
对每一帧的人体姿态特征表示为:23个人体关节点的旋转和一个全局旋转。作者沿用这种方式,使用6D旋转表征-->\(R_t\in \mathbb{R}^{24\times 6}\)
定义:
输入\(P_t\): \(R_t\)和\(D_t\)的集合
输出:
\(V_t\):人体网格向量
\(J_t\):身体关节坐标
3.1 Conditional Transformer VAE for Motions
Fig.1. 网络结构
Source:
Encoder:
输入:
- \(a\):action
- \(P_1,...,P_T\):任意长度的动作序列
输出: - \(\mu\):潜在变量的均值
- \(\sum\):潜在变量的方差
结构: - \(\mu_a^{token},\sum_a^{token}\):可学习的类-token。
linear projection
:将\(P_i\)映射到\(\mathbb{R}^d\)空间PE
: 位置编码,沿用传统的sinusoidal functions
Transformer Encoder
:处理时间维度的信息。
Decoder:
输入:- \(z \in M\) : 潜在变量,通过同参数技巧采样而来
- \(a\):action label
duration T
:时间
输出:- \(\hat{P}_1,...,\hat{P}_T\):生成的动作序列。
结构, 同经典的Transformer decoder
- \(b_a^{token}\):可以学习的偏置,将潜在表征转移到与动作相关的空间。
3.2. Training
3种损失函数
Reconstruction loss on pose parameters
\(\mathcal{L}_P\)Reconstruction on vertex coordinates
\(\mathcal{L}_V\)KL loss
\(\mathcal{L}_{KL}\) :VAE的损失函数
4. Experiments
4.1. datasets and metrics
- datasets
- NTU RGB+D dataset
- HumanAct12 dataset
- UESTC dataset
- Evaluation metrics
- FID
- action recognition accuracy
- overall diversity
- per-action diversity
4.2. Ablation study
Reconstruction loss
Fig.2. Reconstruction loss
Source:
Root translation:
根关节点位移对实验结果的影响。
Fig.3. Generating the 3D root translation
Source:
Architecture design
比较不同网络结构对结果的影响
Fig.4. 网络结构
Source:
Training with sequence of variable durations
固定输入时间步长,预测变时间步长的动作;和输入变化时间步长的序列做预测
Fig.5. 生成不同时间长度的序列
Source:
5. 总结
该文章解决的问题是通过行为标签,来生成对应的动作序列,基于CVAE
和transformer
。
重点在于:
- 以何种方式来表示人体结构,我已知的有人体骨架,人体表面网格。
- 以何种网络结构来学习动作序列中的特征分布。
- 对实验进行充分的分析,损失函数、网络结构、baseline。以及应用。
标签:Mathis,Transformer,动作,VAE,序列,action,人体 From: https://www.cnblogs.com/guixu/p/16862682.html