论文:Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift In practice, the saturation problem and the resulting vanishing gradi ents are usually addressed by using Rectifified Linear Units (Nair & Hinton, 2010) ReLU(x) = max(x, 0), careful initialization (Bengio & Glorot, 2010; Saxe et al., 2013), and small learning rates. If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
标签:10,would,梯度,2010,值过,衰减 From: https://www.cnblogs.com/hahaah/p/16859961.html