反向传播
预备
假设样本为\(\left\{
\left( \pmb{x}_1 , \pmb{y}_1 \right),
\left( \pmb{x}_2 , \pmb{y}_2 \right),
\dots ,
\left( \pmb{x}_n , \pmb{y}_n \right)
\right\}\), \(\pmb a^i_j\)为第\(i\)层第\(j\)个神经元的结果, \(\pmb z^i_j\)为第\(i\)层第\(j\)个神经元激活函数的输入, \(\pmb w^i_j\)为第\(i\)层第\(j\)个神经元的权重组成的向量, \(b^i_j\)为第\(i\)层第\(j\)个神经元的偏置量。\(l+1\)代表神经元的层数(第\(0\)层为输入层,第\(l\)层为输出层,其他为隐藏层),\(l_i\)代表第\(i\)层的神经元个数。
\(h\left( x \right) = \frac{1}{1+e^{-x}}\) , \(\pmb h(\pmb x) = \pmb h\left( \begin{bmatrix}
x_1 \\
x_2\\
\vdots \\
x_n
\end{bmatrix}\right) = \begin{bmatrix}
h(x_1) \\
h(x_2)\\
\vdots \\
h(x_n)
\end{bmatrix}\)
令
\(\pmb A^0 = \begin{bmatrix}
\pmb{x}_1\\
\pmb{x}_2\\
\vdots \\
\pmb{x}_n
\end{bmatrix}\), \(\pmb A^i = \begin{bmatrix}
\pmb{a}^i_1 \\
\pmb{a}^i_2\\
\vdots \\
\pmb{a}^i_n
\end{bmatrix}\),
得\(\pmb{z}^{i}_{j} = \left( \pmb A^{i-1}\right)^T \pmb w^{i}_{j} + b^i_j \pmb 1\),\(\pmb a^i_j = \pmb h(\pmb z ^i_j)\)
损失函数
\[J(\pmb{W}, \pmb{b})=\frac{1}{N} \sum_{n=1}^{N} \mathcal{L}\left(\pmb{y}^{(n)}, \hat{\pmb{y}}^{(n)}\right) \]由于上式是一个累加,对求导有分配律,这样我们对可以每个样例分开求损失,最后将其累加。比如我们拿出一个样例设作\(\left( \pmb{x}, \pmb{y} \right)\)预测结果为\(\hat{\pmb y}\)(这时是\(\pmb A\)是\(l\times 1\times p\)),来进行反向传播。
反向传播
我们先尝试求一个最简单的比如先求第\(l\)层的第i个神经元的参数对损失的导数\(\frac{\partial J}{\partial \pmb w^l}\),由于\(\pmb{A}^{l}_{j} = \pmb h\left(\left( \pmb A^{l-1}\right)^T \pmb w^{l}_{j} + b^l_j \pmb 1\right)\),这时\(\frac{\partial J}{\partial \pmb w^{l}} = \frac{\partial \pmb z^{l}}{\partial \pmb w^l}\frac{\partial \pmb A^{l}_i}{\partial \pmb z^{l}}\frac{\partial J}{\partial \pmb A^{l}_i}\),很容易得到,但若求第\(l-1\)层的第i个神经元的参数对损失的导数呢?\(\frac{\partial J}{\partial \pmb w^{l}} =
\frac{\partial \pmb z^{l}}{\partial \pmb w^l}
\frac{\partial \pmb A^{l-1}_j}{\partial \pmb z^{l}}
\frac{\partial \pmb z^{l}}{\partial \pmb A^{l-1}_j}
\frac{\partial \pmb A^{l}_i}{\partial \pmb z^{l}}
\frac{\partial J}{\partial \pmb A^{l}_i}\)这里需要把后面那一块重新求一边。这就是为什么要用反向传播算法。
如图,我们将箭头的反方向就是反向传播的方向。为了较少时间我们可以单独设一个变量单独存每一层的\(z\)对损失函数的导数,我们用\(\delta^i = \frac{\partial J}{\partial \pmb z^i}\),然后通过乘上\(\frac{\partial z ^i}{ \partial W}\)或\(\frac{\partial z ^i}{ \partial b}\)得到最终结果。
假设我们已经求出\(\delta^l = \frac{\partial J}{\partial \pmb z^l}\),而\(\delta^{l-1} = \frac{\partial \pmb a^l}{\partial \pmb z^{l-1}}\frac{\partial \pmb z^l}{\partial \pmb a^l}\frac{\partial J}{\partial \pmb z^l} = \frac{\partial \pmb a^l}{\partial \pmb z^{l-1}}\frac{\partial \pmb z^l}{\partial \pmb a^l }\delta^{l}\)这样我们就得到了一个反向的递推关系
即
而\(\pmb a^i_j = \hat{\pmb y}^j\)
得\(\frac{\partial J}{\partial a^l_j} = \frac{\partial \mathcal{L}(\pmb{y}, \hat{\pmb{y}})}{\partial a^l_j} = \frac{\partial \mathcal{L}(\pmb{y}, \hat{\pmb{y}})}{\partial \hat{\pmb y}}\)后面这个偏导,只需要知道损失函数,就能求出来。
求对某一层参数的偏导只需要在图中分叉的部分的\(\delta\)乘上其对系数的偏导。
即