Questions
- How to combat with overfitting
- Difference between Random Forest and XGBoost
- How to handle missing value?
- How to train a regression tree?
- Difference between Gradient Descent (GD) and Stochastic Gradient Descent (SGD)
- More Variants of SGD
- Non-differentiable?
- L1 & L2 Regularization
- Batch Normalization (BN) vs Layer Normalization (LN)
- Dropout Principles & Implementation
- Order of Normalization, Activation and Dropout
- Implement Cross Entropy Loss
- Implement Focal Loss
How to combat with overfitting
- Regularization (Search for regularization coefficient)
- Dropout (Randomly drops neurons during training)
- Simplify Model (Reduce the number of trainable parameters)
- Early Stopping
- Ensemble Methods (Bagging like Random Forest, Boosting like XGBoost)
- Data Augmentation
- Feature Selection (Remove irrelevant or highly correlated features)
Difference between Random Forest and XGBoost
- Random Forest will train multiple decision trees independently on subset of whole dataset. During inference, decision trees will process the input independently and then the results will be aggregated together. (vote for classification, average for regression).
- XGBoost will train multiple decision trees sequentially, where the later tree will try to revise the error made by the former tree. During inference, the first tree will produce an initial prediction and the following trees will produce the increment to adjust previous prediction. The final prediction is the sum of all predictions made by each tree.
Feature | Random Forest | XGBoost |
---|---|---|
Ensemble Method | Bagging | Boosting |
Tree Growth | Independent, deep trees | Sequential, shallow trees |
Objective Function | Averaging/voting | Gradient-based with regularization |
Bias-Variance Tradeoff | Reduces variance | Reduces both bias and variance |
Speed | Faster due to parallelism | Slower but optimized with advanced techniques |
Interpretability | More interpretable | Less interpretable without tools like SHAP |
How to handle missing value?
- Fill the missing value with mean, median or majority.
- Fill the missing value by predict it with another model.
- Treat it as a special feature, XGBoost and LightGBM can handle it automatically.
How to train a regression tree?
The regression tree will split the feature space into several subspace and each subspace has a certain prediction value. The goal is usually to minimize the MSE or MAE of the division, the corresponding prediction is the mean and median of the divided sub data.
Difference between Gradient Descent (GD) and Stochastic Gradient Descent (SGD)
GD uses the whole dataset to perform forward propagation, compute the gradient of loss function and back propagation. Comparatively, SGD only uses one sample to do the propagation and parameters update.
Usually, GD will converge more smoothly but the computation cost is expensive, especially for large dataset. SGD will converge more quickly with more noise and unstable, which means the order of samples will influent the final performance.
More Variants of SGD
- Batch Gradient Descent (BGD)
Uses a mini batch to compute gradient. When B = 1 B=1 B=1, it becomes SGD; when B = L e n ( D a t a s e t ) B=Len(Dataset) B=Len(Dataset), it becomes GD. BGD is more stable than SGD and less resource consuming than GD. - Momentum-Based SGD
Momentum performs like a velocity vector that accumulates past gradients, so the parameter will move more consistently, which accelerates convergence, prevents fluctuation and more likely to pass the local minimums. - Adagrad (Adaptive Gradient Algorithm)
Scale down the updates of frequently updated parameters (divide by the square of past gradients), which prevents fluctuation. However, the gradients will keep decreasing to 0, which ends the training process earlier than expected. - RMSprop (Root Mean Square Propagation)
To prevent the gradients from becoming zero, RMSprop change the denominator of Adagrad into an exponentially decaying average of past squared gradients, which is stable and won’t lead to a zero. - Adam (Adaptive Moment Estimation)
Combining RMSprop and Momentum, the nominator is exponentially decaying gradients, aka momentum and the denominator is exponentially decaying past squared gradients. - AdamW
Adam just adds the regularization term to the loss function and compute gradients. However, AdamW add this term when each parameter is updated.
Non-differentiable?
- Subgradient: a set of slopes that "lay under” the function. f ( y ) ≥ f ( x ) + g ( y − x ) , ∀ y f(y) \ge f(x) + g(y-x), \forall y f(y)≥f(x)+g(y−x),∀y, for differentiable function, the subgradient has only 1 solution; for convex function, the subgradient is a set; for non-convex function, the subgradient may not exist.
- Proximal Gradient: Optimize differentiable part first and then find a Proximal Operator for non-differentiable part.
- Smoothed Approximations: Find a differentiable approximation.
- Gradient-Free Optimization Methods: Genetic Algorithm or Simulated Annealing.
L1 & L2 Regularization
L1 Regularization tends to produce spare parameters where many weights are exact zeros, while L2 Regularization tends to produce parameters with small weights but not zero.
Batch Normalization (BN) vs Layer Normalization (LN)
BN is usually applied to CNN and LN is usually applied to RNN or other sequential data. BN is normalized for every channel or say position in the sequence. For sequential data, it is not guaranteed that the lengths of each sequence are the same. Even though we can pad the sequence, there may not be enough data to train robust parameters in BN. Therefore, we usually apply LN to sequential data. In LN, we normalize the data for each sample. LN doesn’t require batch data and is appliable for sequence of any length.
BN will additionally maintain a global mean and std for inference to prevent the instability caused by the batch size. LN has no difference between training and test.
Normalization can keep the hidden states to a stable distribution and prevent gradient explosion and vanishing. Taking each block as an independent classifier, normalization is to make sure that the input of each classifier following the same distribution, or the distribution of input may vary a lot while the depth of network increase.
Normalization also provides some regularization function because it adds some noise into the hidden layer.
Dropout Principles & Implementation
During each training iteration, randomly set the output of certain neurons to 0 to perform regularization.
- Discourage the dependencies among neurons, making the network more generalizable.
- In inference, the subnetworks can be considered as sub-model for ensemble.
- During training, the output need to be rescaled by 1 1 − p \frac{1}{1-p} 1−p1 because p p p weights are dropped, to maintain a consistent magnitude of the output.
def dropout(x, prob):
mask = (np.random.rand(*x.shape) > prob).astype(int)
x = x * mask / (1-prob)
return x
Remember to use *
to unzip the tuple of x.shape
Order of Normalization, Activation and Dropout
Usually normalization -> activation -> dropout. Normalization will make the input to be the same distribution to make the training stable. Then, dropout the output of activation layer to perform regularization.
Implement Cross Entropy Loss
The definition of Cross Entropy Loss is
L
(
p
,
q
)
=
−
∑
i
p
i
log
q
i
\mathcal{L}(p, q) = -\sum_i p_i\log q_i
L(p,q)=−∑ipilogqi. Literally, the computation should go through every value in the distribution. However, for most classification task, the ground truth is a one-hot encoding that the probability of only one value is 1 and the probability of other values are 0. Therefore, for each sample the loss is just
−
log
q
g
r
o
u
n
d
t
r
u
t
h
-\log q_{ground\ truth}
−logqground truth.
Notice that here we assume that the input predicted label is after-softmax., but in torch.nn.CrossEntropy
there is an inherent log-softmax in the function.
def CrossEntropy(y, y_pred):
# y(B, ) y_pred(B, C)
prob = y_pred[:, y]
log_prob = torch.log(prob)
sum_up = torch.sum(log_prob)
return -sum_up
Implement Focal Loss
Focal Loss is designed to assign more weight on badly predicted sample, or it can be seen as a variant of Cross Entropy Loss. In detail, if the predicted probability of ground truth is p p p, then the assigned weight will be ( 1 − p ) γ (1-p)^\gamma (1−p)γ to emphasize badly performed sample and understate well performed sample.
def Focal_Loss(y, y_pred, gamma):
prob = y_pred[:, y]
log_prob = torch.log(prob)
sum_up = torch.sum(((1-prob)**gamma)*log_prob)
return -sum_up
标签:log,ML,sum,will,questions,each,interview,prob,Normalization
From: https://blog.csdn.net/ShadyPi/article/details/143766237