首页 > 其他分享 >变分自编码器 - VAE: Variational Auto-Encoder

变分自编码器 - VAE: Variational Auto-Encoder

时间:2023-01-04 12:33:14浏览次数:58  
标签:编码器 mathbf vert Auto boldsymbol phi VAE theta KL

总之,VAE 本身是一个生成模型,我们假设观测的某个变量 \(\mathbf{x}\)(比如数字 0~9 的各种图像)受到隐变量 \(\mathbf{z}\) 的影响,那么在得到分布后,只需要采样得到一个 \(\mathbf{z}\),我们就能得到一个 \(\mathbf{x}\)

参考 - Lil's blog
参考 - 科学空间
pytorch 实践

Autocoder is invented to reconstruct high-dimensional data using a neural network model. A nice byproduct is dimension reduction: the bottleneck layer captures a compressed latent encoding. Such a low-dimensional representation can be used as en embedding vector in various applications (i.e. search), help data compression, or reveal the underlying data generative factors.

Notation

Symbol Mean
\(\mathcal{D}\) The dataset, \(\mathcal{D} = \{ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(n)} \}\) , contains \(n\) data samples; \(\vert\mathcal{D}\vert =n\).
\(\mathbf{x}^{(i)}\) Each data point is a vector of \(d\) dimensions, \(\mathbf{x}^{(i)} = [x^{(i)}_1, x^{(i)}_2, \dots, x^{(i)}_d]\).
\(\mathbf{x}\) One data sample from the dataset, \(\mathbf{x} \in \mathcal{D}\).
\(\mathbf{x}'\) The reconstructed version of \(\mathbf{x}\).
\(\tilde{\mathbf{x}}\) The corrupted version of \(\mathbf{x}\).
\(\mathbf{z}\) The compressed code learned in the bottleneck layer.
\(a_j^{(l)}\) The activation function for the \(j\)-th neuron in the \(l\)-th hidden layer.
\(g_{\phi}(.)\) The encoding function parameterized by \(\phi\).
\(f_{\theta}(.)\) The decoding function parameterized by \(\theta\).
\(q_{\phi}(\mathbf{z}\vert\mathbf{x})\) Estimated posterior probability function, also known as probabilistic encoder.
\(p_{\theta}(\mathbf{x}\vert\mathbf{z})\) Likelihood of generating true data sample given the latent code, also known as probabilistic decoder.

Autoencoder

The model contains an encoder function \(g(.)\) parameterized by \(\phi\) and a decoder function \(f(.)\) parameterized by \(\theta\). The low-dimensional code learned for input \(\mathbf{x}\) in the bottleneck layer is \(\mathbf{z} = g_\phi(\mathbf{x})\) and the reconstructed input is \(\mathbf{x}' = f_\theta(g_\phi(\mathbf{x}))\).

The parameters \((\theta, \phi)\) are learned together to output a reconstructed data sample same as the original input, \(\mathbf{x} \approx f_\theta(g_\phi(\mathbf{x}))\), or in other words, to learn an identity function. A simple metric is MSE loss:

\[L_\text{AE}(\theta, \phi) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\mathbf{x}^{(i)})))^2 \]

Denoising Autoencoder

To avoid overfitting and improve the robustness, Denoising Autoencoder (Vincent et al. 2008) proposed a modification to the basic autoencoder. The input is partially corrupted by adding noises to or masking some values of the input vector in a stochastic manner, \(\tilde{\mathbf{x}} \sim \mathcal{M}_\mathcal{D}(\tilde{\mathbf{x}} \vert \mathbf{x})\)
. Then the model is trained to recover the original input (note: not the corrupt one).

\[\begin{aligned} \tilde{\mathbf{x}}^{(i)} &\sim \mathcal{M}_\mathcal{D}(\tilde{\mathbf{x}}^{(i)} \vert \mathbf{x}^{(i)})\\ L_\text{DAE}(\theta, \phi) &= \frac{1}{n} \sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\tilde{\mathbf{x}}^{(i)})))^2 \end{aligned} \]

where \(\mathcal{M}_\mathcal{D}\) defines the mapping from the true data samples to the noisy or corrupted ones.

In the experiment of the original DAE paper, the noise is applied in this way: a fixed proportion of input dimensions are selected at random and their values are forced to 0.

VAE: Variational Autoencoder

Instead of mapping the input into a fixed vector, we want to map it into a distribution. Let’s label this distribution as \(p_\theta\), parameterized by \(\theta\). The relationship between the data input \(\mathbf{x}\) and the latent encoding vector \(\mathbf{z}\) can be fully defined by:

  • Prior \(p_\theta(\mathbf{z})\),关于 \(\mathbf{z}\)
  • Likelihood \(p_\theta(\mathbf{x}\vert\mathbf{z})\)
  • Posterior \(p_\theta(\mathbf{z}\vert\mathbf{x})\)

如果知道分布参数 \(\theta^{*}\),我们可以通过对 \(p_{\theta^*}(\mathbf{z})\) 采样得到 \(\mathbf{z}\),再通过分布 \(p_{\theta^*}(\mathbf{x} \vert \mathbf{z})\) 得到一个接近真实数据的 \(\mathbf{x}\)

Assuming that we know the real parameter \(\theta^{*}\) for this distribution. In order to generate a sample that looks like a real data point \(\mathbf{x}^{(i)}\), we follow these steps:

  1. First, sample a \(\mathbf{z}^{(i)}\) from a prior distribution \(p_{\theta^*}(\mathbf{z})\).
  2. Then a value \(\mathbf{x}^{(i)}\) is generated from a conditional distribution \(p_{\theta^*}(\mathbf{x} \vert \mathbf{z} = \mathbf{z}^{(i)})\).

于是在给定样本集 \(\mathcal{D}\) 后,对于每个样本 \(\mathbf{x}^{(i)}\) 对应生成概率 \(p_\theta(\mathbf{x}^{(i)}) = \int p_\theta(\mathbf{x}^{(i)}\vert\mathbf{z}) p_\theta(\mathbf{z}) d\mathbf{z}\),而这个 \(\theta^{*}\) 应当是最大似然的:

The optimal parameter \(\theta^{*}\) is the one that maximizes the probability of generating real data samples:

\[\begin{aligned} \theta^{*} &= \arg\max_\theta \prod_{i=1}^n p_\theta(\mathbf{x}^{(i)}) \\ &= \arg\max_\theta \sum_{i=1}^n \log p_\theta(\mathbf{x}^{(i)}) \\ \end{aligned} \]

where

\[p_\theta(\mathbf{x}^{(i)}) = \int p_\theta(\mathbf{x}^{(i)}\vert\mathbf{z}) p_\theta(\mathbf{z}) d\mathbf{z} \]

不过这种计算方法并不好算,我们引入一个新的分布 \(q_\phi(\mathbf{z}\vert\mathbf{x})\),用来拟合 \(p_\theta(\mathbf{z}\vert\mathbf{x})\)

Unfortunately it is not easy to compute \(p_\theta(\mathbf{x}^{(i)})\) in this way, as it is very expensive to check all the possible values of \(\mathbf{z}\) and sum them up. To narrow down the value space to facilitate faster search, we would like to introduce a new approximation function to output what is a likely code given an input \(\mathbf{x}\), \(q_\phi(\mathbf{z}\vert\mathbf{x})\), parameterized by \(\phi\).

Now the structure looks a lot like an autoencoder:

  • The conditional probability \(p_\theta(\mathbf{x} \vert \mathbf{z})\) defines a generative model, similar to the decoder \(f_\theta(\mathbf{x} \vert \mathbf{z})\) introduced above. \(p_\theta(\mathbf{x} \vert \mathbf{z})\) is also known as probabilistic decoder.
  • The approximation function \(q_\phi(\mathbf{z} \vert \mathbf{x})\) is the probabilistic encoder, playing a similar role as \(g_\phi(\mathbf{z} \vert \mathbf{x})\) above.

Loss Function: ELBO, Evidence Lower Bound

引入一个新的分布 \(q_\phi(\mathbf{z}\vert\mathbf{x})\) 来拟合 \(p_\theta(\mathbf{z}\vert\mathbf{x})\),我们用 KL 散度来衡量两者的相似程度,可以推导得到下面的等式

We can use Kullback-Leibler divergence to quantify the distance between these two distributions, \(q_\phi(\mathbf{z}\vert\mathbf{x})\) and \(p_\theta(\mathbf{z}\vert\mathbf{x})\). Recall that KL divergence \(D_\text{KL}(X|Y)\) measures how much information is lost if the distribution Y is used to represent X.
In our case we want to minimize \(D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) | p_\theta(\mathbf{z}\vert\mathbf{x}) )\) with respect to \(\phi\).

Why use \(D_\text{KL}(q_\phi | p_\theta)\) (reversed KL) instead of \(D_\text{KL}(p_\theta | q_\phi)\) (forward KL):
首先注意到,以 $$D_\text{KL}\Big(p(x)\Big\Vert q(x)\Big) = \int p(x)\ln \frac{p(x)}{q(x)} dx=\mathbb{E}_{x\sim p(x)}\left[\ln \frac{p(x)}{q(x)}\right]$$ 为例,若 \(q(x)\) 在某个区域为零,而 \(p(x)\) 不为零,那么会出现 KL 散度无穷大的问题,换句话说,此时 \(q(x)\) 的非零区域是包含 \(p(x)\) 的
Forward KL divergence: \(D_\text{KL}(P|Q) = \mathbb{E}_{z\sim P(z)} \log\frac{P(z)}{Q(z)}\); we have to ensure that Q(z)>0 wherever P(z)>0. The optimized variational distribution \(q(z)\) has to cover over the entire \(p(z)\).
Reversed KL divergence: \(D_\text{KL}(Q|P) = \mathbb{E}_{z\sim Q(z)} \log\frac{Q(z)}{P(z)}\); minimizing the reversed KL divergence squeezes the \(Q(z)\) under \(P(z)\).

Let’s now expand the equation, so we have:

\[D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) ) =\log p_\theta(\mathbf{x}) + D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z})) - \mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})}\log p_\theta(\mathbf{x}\vert\mathbf{z}) \]

Once rearrange the left and right hand side of the equation,

\[\log p_\theta(\mathbf{x}) - D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) ) = \mathbb{E}_{\mathbf{z}\sim q_\phi(\mathbf{z}\vert\mathbf{x})}\log p_\theta(\mathbf{x}\vert\mathbf{z}) - D_\text{KL}(q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z})) \]

等号左边第一项,就是似然取 log;第二项就是 KL 散度取负,我们自然希望等号两边越大越好;取负后,就可以作为代价函数 loss 使用

The LHS of the equation is exactly what we want to maximize when learning the true distributions: we want to maximize the (log-)likelihood of generating real data (that is \(\log p_\theta(\mathbf{x})\)) and also minimize the difference between the real and estimated posterior distributions (the term \(D_\text{KL}\) works like a regularizer). Note that \(p_\theta(\mathbf{x})\) is fixed with respect to \(q_\phi\).

So, the negation of the above defines our loss function:

\[\begin{aligned} L_\text{VAE}(\theta, \phi) &= -\log p_\theta(\mathbf{x}) + D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) )\\ &= - \mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x})} \log p_\theta(\mathbf{x}\vert\mathbf{z}) + D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}) ) \\ \theta^{*}, \phi^{*} &= \arg\min_{\theta, \phi} L_\text{VAE} \end{aligned} \]

这里还有一个含义,当我们不断减小 loss 时,意味着似然的下界在增大

In Variational Bayesian methods, this loss function is known as the variational lower bound, or evidence lower bound. The “lower bound” part in the name comes from the fact that KL divergence is always non-negative and thus \(-L_\text{VAE}\) is the lower bound of \(\log p_\theta (\mathbf{x})\).

\[-L_\text{VAE} = \log p_\theta(\mathbf{x}) - D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}\vert\mathbf{x}) ) \leq \log p_\theta(\mathbf{x}) \]

Therefore by minimizing the loss, we are maximizing the lower bound of the probability of generating real data samples.

Reparameterization Trick

回到代价函数:

\[L_\text{VAE}(\theta, \phi) = - \mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x})} \log p_\theta(\mathbf{x}\vert\mathbf{z}) + D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}) ) \]

分别考察两部分,前一部分求的是在 \(\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x})\) 采样下的期望,但是采样是没法求导的,所以我们改写这个分布:\(\mathbf{z} = \mathcal{T}_\phi(\mathbf{x}, \boldsymbol{\epsilon})\),将随机的任务丢给 \(\boldsymbol{\epsilon}\)

The expectation term in the loss function invokes generating samples from \(\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x})\). Sampling is a stochastic process and therefore we cannot backpropagate the gradient. To make it trainable, the reparameterization trick is introduced: It is often possible to express the random variable \(\mathbf{z}\) as a deterministic variable \(\mathbf{z} = \mathcal{T}_\phi(\mathbf{x}, \boldsymbol{\epsilon})\), where \(\boldsymbol{\epsilon}\) is an auxiliary (辅助的) independent random variable, and the transformation function \(\mathcal{T}_\phi\) parameterized by \(\phi\) converts \(\boldsymbol{\epsilon}\) to \(\mathbf{z}\) (将 \(\boldsymbol{\epsilon}\) 转化成 \(\mathbf{z}\)).

此外,我们不是要用 \(q_\phi(\mathbf{z}\vert\mathbf{x})\) 拟合嘛,我们得给一个形式,一般地我们选择用 多元高斯分布,而且为了方便我们假定这个高斯分布的协方差矩阵是对角阵,于是有 \(\mathbf{z} = \mathcal{T}_\phi(\mathbf{x}, \boldsymbol{\epsilon}) = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}\),其中 \(\boldsymbol{\mu}, \boldsymbol{\sigma}\) 是编码器学到的关于 \(\mathbf{x}\) 的输出

For example, a common choice of the form of \(q_\phi(\mathbf{z}\vert\mathbf{x})\) is a multivariate Gaussian with a diagonal covariance structure:

\[\begin{aligned} \mathbf{z} &\sim q_\phi(\mathbf{z}\vert\mathbf{x}^{(i)}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{2(i)}\boldsymbol{I}) & \\ \mathbf{z} &= \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon} \text{, where } \boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I}) & \scriptstyle{\text{; Reparameterization trick.}} \end{aligned} \]

where \(\odot\) refers to element-wise product

The reparameterization trick works for other types of distributions too, not only Gaussian. In the multivariate Gaussian case, we make the model trainable by learning the mean and variance of the distribution, \(\mu\) and \(\sigma\), explicitly using the reparameterization trick, while the stochasticity (随机性) remains in the random variable \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I})\).

Illustration of variational autoencoder model with the multivariate Gaussian assumption.

代价函数前半部分

于是,代价函数的第一项也就成了:

\[-\mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I})} \log p_\theta(\mathbf{x}\vert\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}) \]

可是这又咋算?这时我才知道可以 采样计算,又即蒙特卡洛模拟的基础:

\[\mathbb{E}_{x\sim p(x)}[f(x)] = \int f(x)p(x)dx \approx \frac{1}{n}\sum_{i=1}^n f(x_i),\quad x_i\sim p(x) \]

那么应该采样几个呢?VAE 给出的答案是:一个!
那么代价函数的第一项就成了:

\[- \log p_\theta(\mathbf{x}\vert\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}), \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I}) \]

为啥只要一次呢,事实上我们会运行多个 epoch,每次的隐变量都是随机生成的,因此当 epoch 数足够多时,事实上是可以保证采样的充分性的

代价函数后半部分

此外,我们还有代价函数的后半部分呢:\(D_\text{KL}( q_\phi(\mathbf{z}\vert\mathbf{x}) \| p_\theta(\mathbf{z}) )\)

其中 \(\mathbf{z} \sim q_\phi(\mathbf{z}\vert\mathbf{x}^{(i)}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{2(i)}\boldsymbol{I})\) 是我们已经做好的,后者呢?我们直接认为,\(\mathbf{z}\) 服从标准正态分布,这样假设的原因是注意到 “任何 d 维分布都可以从一个 d 维高斯分布 + 一个足够复杂的函数映射得到”,足够复杂的函数已经用解码器做了
还有另一种理解,VAE 期望所有 \(q_\phi(\mathbf{z}\vert\mathbf{x})\),或者说 \(p_\theta(\mathbf{z}\vert\mathbf{x})\) 趋近于标准正态分布,那么就会有 \(\mathbf{z}\) 趋近正态分布,这样就能保证我们之后能通过在标准正态分布中采样 \(\mathbf{z}\) 来做生成了,所以令 \(p_\theta(\mathbf{z})\) 为标准正态做的 KL 散度,也是为了让 \(q_\phi(\mathbf{z}\vert\mathbf{x})\) 趋近正态分布的正则项

那么我们在做的就是求两个高斯分布的 KL 散度,有公式:

\[\begin{aligned} D_\text{KL}(\mathcal{N}(\boldsymbol{\mu_0}, \boldsymbol{\Sigma_0}) \| \mathcal{N}(\boldsymbol{\mu_1}, \boldsymbol{\Sigma_1})) &= \frac{1}{2}\bigg( \text{tr}(\boldsymbol{\Sigma_1}^{-1} \boldsymbol{\Sigma_0}) + (\boldsymbol{\mu_1}-\boldsymbol{\mu_0})^T\boldsymbol{\Sigma_1}^{-1}(\boldsymbol{\mu_1}-\boldsymbol{\mu_0}) - d + \log\frac{\det \boldsymbol{\Sigma_1}}{\det \boldsymbol{\Sigma_0}} \bigg) \\ &= \frac{1}{2}\bigg( \text{tr}(\boldsymbol{\sigma}^{2(i)}\boldsymbol{I}) + \boldsymbol{\mu}^{(i)T}\boldsymbol{\mu}^{(i)} - d - \log \det (\boldsymbol{\sigma}^{2(i)}\boldsymbol{I}) \bigg) & \text{; $\boldsymbol{\mu_0}=\boldsymbol{\mu}^{(i)}, \boldsymbol{\Sigma_0} = \boldsymbol{\sigma}^{2(i)}\boldsymbol{I}, \boldsymbol{\mu_1}=0, \boldsymbol{\Sigma_1} = \boldsymbol{I}$} \\ \end{aligned} \]

以上,我们就完成力,我们最后的神经网络长下面这样

标签:编码器,mathbf,vert,Auto,boldsymbol,phi,VAE,theta,KL
From: https://www.cnblogs.com/zhyh/p/17024514.html

相关文章

  • AutoCloseable
    转载自https://blog.csdn.net/WU4566285/article/details/1151011281、AutoCloseable接口的介绍从AutoCloseable的注释可知它的出现是为了更好的管理资源,准确说是资源的释......
  • 关闭Mac的Microsoft AutoUpdate弹框提示
    macOS安装MicrosoftOfficeforMac之后,有时候会弹出MicrosoftAutoUpdate微软应用自动更新工具。就像下面这样:(我不知道您会不会烦,我是烦了)如果您也和我一样,不喜欢这样不......
  • svg pointer-events="auto"
    svg  pointer-events="auto" <svgwidth="640"height="480"version="1.1"xmlns="http://www.w3.org/2000/svg"xmlns:xlink="http://www.w3.org/1999/xlink"poi......
  • 免费又好用的财务机器人来了!----我为什么推荐Power Automate Desktop平台
       有30多年会计信息化教学和实践经验的我,从2017年德勤财务机器人一问世就密切关注着这个领域的发展,一直在比较各种RPA(机器人流程自动化)平台的性能与价格,并于2020年在......
  • AutoCAD Civil3D 2023安装图文教程
    ​Civil3D2023简单介绍Civil3D设计软件是土木工程师的首选软件,也是民用大众领域的综合解决方案,同时Civil3D提供了AutoCAD和AutoCADMap3D的功能,该软件使用基于模型的......
  • 笔记本 AUTO模式是什么意思
    相关文章电脑AUTO是什么意思https://zhidao.baidu.com/question/1645428036244428940.html##简介电脑屏幕、显示器上的按键“AUTO”是用来自动校对屏幕显示的位置的按......
  • 对象到对象映射-AutoMapper
    AutoMapper是一个对象-对象映射器,可以将一个对象映射到另一个对象。用来解决一个看似复杂的问题,这种类型的代码编写起来相当枯燥乏味,官网地址:​​http://automapper.org/​......
  • JAVAEE零基础小白免费技术教程之集合体系结构讲解
    day13_JAVAOOP课程目标1.【理解】集合的体系结构2.【掌握】Collection集合中常用的方法3.【理解】Iterator迭代器4.【掌握】增强for的使用5.【理解】List集合的......
  • 【李宏毅机器学习】自编码器auto-encoder
    note:VAE的本质结构:重构的过程是希望没噪声的,而KLloss则希望有高斯噪声的,两者是对立的。所以,VAE跟GAN一样,内部其实是包含了一个对抗的过程,只不过它们两者是混合起来,共......
  • autosar分层架构
    AUTOSAR分层一、应用层:通过端口(PORT)交互,每个SWC可以包含一个或者多个实体(RunnableEntity),可由RTE事件触发。原子SWC包括应用软件SWC,传感器SWC,ECU抽象软件SWC等......