前言
\(\quad\) 在$《深度学习》^{[1]} $一书中,为说明Linear ALgebra在深度学习中的作用,chapter2的最后一节引入了PCA的思想,并且为方便起见,提前给定了解码器的映射,即 \(f(\mathbf{c}) = \mathbf{Dc}\) ,其中\(\mathbf{D} \in \mathbb{R}^{n\times l}\) , 那么相应的编码器的映射需要依照解码器给出,在教材中利用 \(\textbf{L}^2\) 范数对解码后的向量与原向量间距离最小给出了相应的编码器函数-也为简单的线性映射,即\(r(\mathbf{x}) = g(f(\mathbf{x})) = \mathbf{D}\mathbf{D}^{T}\mathbf{x}\),现在我们已经知道这些解码与编码的映射,但具体的映射是什么呢?也就是我们前面所引入的矩阵$\mathbf{D} $具体而言是什么?
\(\quad\) 教材中具体推导了 $l=1 $的情形,其实就是在解决以下的优化问题:
\[\mathbf{d}^*= \mathop{\arg\min}\limits_{\mathbf{d}} \sum_{i} \Vert \mathbf{x}^{(i)} -\mathbf{d}\mathbf{d}^T\mathbf{x}^{(i)}\Vert_2^{2} \quad subject \ to \ \Vert \mathbf{d} \Vert_2 = 1 \]\(\quad\) 下面,我们对于一般的情形,推导该过程。
问题描述
上面一般情形的问题可以转换为下面的最优化问题:
\[\mathbf{D}^* =\mathop{\arg\min}\limits_{\mathbf{D}} \sum_{i} \Vert \mathbf{x}^{(i)} -\mathbf{D}\mathbf{D}^T\mathbf{x}^{(i)}\Vert_2^{2} \quad subject \ to \ \mathbf{D}^T\mathbf{D} = \mathbf{I}_l \]\(\quad\) 下面对上述问题进行化简:首先,为符号简便起见,记 \(\mathbf{X} = [\mathbf{x^{(1)}},\mathbf{x^{(2)}},\dots,\mathbf{x^{(n)}}]^T\) ,则上述问题可以写为:
\[\mathbf{D}^* =\mathop{\arg\min}\limits_{\mathbf{D}} \Vert \mathbf{X} -\mathbf{D}\mathbf{D}^T\mathbf{X} \Vert_F^{2} \quad subject \ to \ \mathbf{D}^T\mathbf{D} = \mathbf{I}_l \]问题求解
$ Lemma1 (Von-Neumann迹不等式)^{[2]}$
如果 $\mathbf{A},\mathbf{B} \in \mathbb{C}^{n \times n} $ ,令 \(\sigma(A) = [\sigma_{1}(\mathbf{A}),\dots,\sigma_{n}(\mathbf{A})]^T\) , \(\sigma(B) = [\sigma_{1}(\mathbf{B}),\dots,\sigma_{n}(\mathbf{B})]^T\) ,各自奇异值满足: $\sigma_{1}(\mathbf{A}) \geq \dots \geq \sigma_{n}(\mathbf{A}) $ , $\sigma_{1}(\mathbf{B}) \geq \dots \geq\sigma_{n}(\mathbf{B}) $ ,
则有: $\left | tr(AB) \right | \leq \sigma(A)^T \sigma(B) $ .
证明:可以参照詹兴致的《矩阵论》第四章。
\(\quad\) 回到上面的分析,需要解决下面问题:
\[\begin{equation*} \begin{split} \mathbf{D}^* &= \mathop{\arg\min}\limits_{\mathbf{D}} \Vert \mathbf{X} - \mathbf{X} \mathbf{D} \mathbf{D}^T \Vert_F^2 \quad \text{subject to} \quad \mathbf{D}^T \mathbf{D} = \mathbf{I}_l \\ &= \mathop{\arg\min}\limits_{\mathbf{D}} \text{tr}[(\mathbf{X} - \mathbf{X}\mathbf{D} \mathbf{D}^T )^T (\mathbf{X} - \mathbf{X}\mathbf{D} \mathbf{D}^T )] \quad \text{subject to} \quad \mathbf{D}^T \mathbf{D} = \mathbf{I}_l \\ &= \mathop{\arg\min}\limits_{\mathbf{D}} \text{tr}(\mathbf{X}^T \mathbf{X} - \mathbf{X}^T \mathbf{X}\mathbf{D} \mathbf{D}^T - \mathbf{D} \mathbf{D}^T \mathbf{X}^T \mathbf{X} + \mathbf{D} \mathbf{D}^T \mathbf{X}^T \mathbf{X} \mathbf{D} \mathbf{D}^T) \quad \text{subject to} \quad \mathbf{D}^T \mathbf{D} = \mathbf{I}_l \\ &= \mathop{\arg\min}\limits_{\mathbf{D}} -2 \text{tr}( \mathbf{X}^T \mathbf{X}\mathbf{D} \mathbf{D}^T) + \text{tr}( \mathbf{X}^T \mathbf{X} \mathbf{D} \mathbf{D}^T) \quad \text{subject to} \quad \mathbf{D}^T \mathbf{D} = \mathbf{I}_l \\ &= \mathop{\arg\min}\limits_{\mathbf{D}} - \text{tr}( \mathbf{X}^T \mathbf{X}\mathbf{D} \mathbf{D}^T) \quad \text{subject to} \quad \mathbf{D}^T \mathbf{D} = \mathbf{I}_l \\ &= \mathop{\arg\max}\limits_{\mathbf{D}} \text{tr}( \mathbf{D} \mathbf{D}^T\mathbf{X}^T \mathbf{X}) \quad \text{subject to} \quad \mathbf{D}^T \mathbf{D} = \mathbf{I}_l \\ \end{split} \end{equation*} \]\(\quad\) 对于上面的最大化问题,我们可以先求到其一个上界,由Lemma1 Von-Neumann迹不等式, \(tr(\mathbf{D} \mathbf{D}^T\mathbf{X}^T \mathbf{X}) \leq \sigma(\mathbf{D} \mathbf{D}^T)^T\mathbf{X}^T \mathbf{X}\) .由于 $ \mathbf{D} \mathbf{D}^T$ 是实对称矩阵,则 $ \mathbf{D} \mathbf{D}^T$ 的奇异值为 $ \mathbf{D} \mathbf{D}^T$ 的特征值的绝对值,又由约束可知,$\mathbf{D}^T \mathbf{D} = \mathbf{I}_l $,且 \(\mathbf{D} \mathbf{D}^T\) 与 \(\mathbf{D}^T \mathbf{D}\) 有相同的非零特征值,剩下的只是0,则 $ \mathbf{D} \mathbf{D}^T$ 的特征值为 \(1 \dots 1, 0 \dots ,0\) ,其中有 $ l $ 个1, $ n-l $ 个0.代回上述Von-Neumann迹不等式,得到: \(tr(\mathbf{D} \mathbf{D}^T\mathbf{X}^T \mathbf{X}) \leq \sum_{i = 1}^{l} \sigma_{i}(\mathbf{X}^T \mathbf{X}) =\sum_{i = 1}^{l} \lambda_{i}(\mathbf{X}^T \mathbf{X})\) ,再通过取 \(\mathbf{D}\) 为 \(\mathbf{X}^T\mathbf{X}\) 的前 $ l $ 个最大特征值的特征向量即可达到上面不等式的上界。
参考文献
[1] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. nature, 2015, 521(7553): 436-444.
[2] 詹兴致. 矩阵论[M]. 高等教育出版社, 2008.
标签:mathbf,Vert,text,tr,成分,数学原理,quad,PCA,sigma From: https://www.cnblogs.com/amuse123/p/18628040