Slide:http://lamda.nju.edu.cn/weixs/slide/CNNTricks_slide.pdf
博文:http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html
1)data augmentation;
2)pre-processing on images;
3)initializations of Networks;
4)some tips during training;
5)selections of activation functions;
6)diverse regularizations;
7)some insights found from figures and finally
8)methods of ensemble multiple deep networks.
Sec. 1: Data Augmentation
训练的时候,训练集有限,可以用Data Augmentation来扩充数据集合;
- (1)、简单的crops: horizontally flipping, random crops andcolor jittering.
- (2)、结合(1)中简单的处理
- (3)、Krizhevsky et al. [1] 提出的 fancy PCA : alters the intensities of the RGB channels in training images.
Sec. 2: Pre-Processing
(1)、 zero-center + normalize:
python实现
>>> X -= np.mean(X, axis = 0) # zero-center
>>> X /= np.std(X, axis = 0) # normalize
(2)、 PCA Whitening:zero-center-->计算covariance matrix(数据之间的correlation结构)-->decorrelate数据-->whitening
python实现
>>> X -= np.mean(X, axis = 0) # zero-center
>>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix
decorrelate data :通过将原来的数据(除了zero-centres)映射带eigenbasis
>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix
>>> Xrot = np.dot(X, U) # decorrelate the data
whitening:用eigenvalue将eigenbasis的每个维度分开来normalize the scale
>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)
Sec. 3: Initializations
(1)、All Zero Initialization
理想状态下认为一般权重为正数一半为负数再见过适当的data normalization
缺点:no source of asymmetry between neurons
(2)、Initialization with Small Random Numbers:
优点:symmetry breaking
思想:the neurons are all random and unique in the beginning,
eg1:
, where
is a zero mean, unit standard deviation gaussian.
eg2:small numbers drawn from a uniform distribution,
(3)、Calibrating the Variances
思想:normalize the variance of each neuron's output to 1 ,但是不会考虑ReLUs
python实现:
>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)
(4)、Current Recommendation
He et al. [4] 关注 ReLUs:variance :
python实现:
>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation.
Sec. 4: During Training
- Filters and pooling size. input images: power-of-2 ; filter (e.g.,
- ) ;strides (e.g., 1) with zeros-padding; pooling :eg:
- .
- Learning rate.利用validation set ,再次 Ilya Sutskever [2]:divide the gradients by mini batch size
- Fine-tune on pre-trained models. 考虑:新的数据集的大小&和预训练模型训练数据集的相似性
- (1)、如果自己的数据和预训练的相似 ,直接在从预训练模型的高层提取的特征尚训练一个 linear classifier
- (2)、如果有许多数据,可以用small learning rate微调预训练模型的高层
- (3)、如果自己的数据集和预训练模型的数据集差异很大,但是有很多训练图像,大部分的layers需要用小的learning rate在自己的数据集上进行 fine-tuned
- (4)、如果自己的数据集小而且与预训练模型数据集差异很大,那就只训练一个 linear classifier.
Sec. 5: Activation Functions :non-linearity
|
Sigmoid
| large negative numbers become 0 and large positive numbers become 1.
|
tanh(x)
| range [-1, 1]. 1、 its activations saturate 2、zero-centered |
Rectified Linear Unit
|
|
Leaky ReLU
| fix the “dying ReLU” problem. if ( : a small constant) if , (cons)the results are not always consistent. |
Parametric ReLU :
| PReLU, is learned from data not pre-defiined[[4]] Leaky ReLU is fixed. RReLU, is a random variable in a given range in the training, and then fixed in the testing[[5]] (cons) reduce overfitting |
Randomized ReLU
RReLU,
在训练时是给定范围的随机变量 ,但在测试时是固定的。[[5]]
| |
Sec. 6: Regularizations
- L2 regularization : add
- to the objective,
- :regularization strength. ( heavily penalizing peaky weight vectors and preferring diffuse weight vectors)
- L1 regularization: add
- to the objective. 结合:
- (Elastic net regularization).
- Max norm constraints. enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.
- .
- (always 3 or 4).update are bounded so the nwtwork wont explores..
- Dropout : [6] only updating the parameters of the sampled network based on the input data .
| [6]. training: keeping a neuron active with some probability (a hyper-parameter), or setting it to zero . testing: no dropout dropout ratio is a reasonable default |
Sec. 7: Insights from Figures
- learning rate
- loss curve.: the “width” of the curve is related to the batch size.
- accuracy curve.
Sec. 8: Ensemble[8]
- Same model, different initialization. 用交叉验证集来决定最好的超参数 hyperparameters, 然后用这些超参数来训练多个 models ,但是随机初始化.
- Top models discovered during cross-validation. 用交叉验证集来决定最好的超参数 hyperparameters,然后选出前n个最好的models来ensemble.(风险是可能包含未达标准的model).
- Different checkpoints of a single model. training非常expensive的情况下, 选取一个single network中不同时刻的不同的 checkpoints 来ensemble. (缺乏多样性,但是cheap).
- Some practical examples. 如果你的任务是high-level image semantic: 可以在不同的数据集上使用多个深度模型来提取不同的互补的深度representations.
Miscellaneous
Problems:
data:class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images.
method1:balance the training data by directly up-sampling and down-sampling the imbalanced data[10].
method2: crops processing[7].
method3 :adjust the fine-tuning strategy
标签:training,训练,Tricks,神经网络,笔记,zero,Sec,np,data From: https://blog.51cto.com/u_12667998/7074285