1. What is Machine Learning?
Machine Learning = Statistic Learning, namely, there is an unknown distribution of (x1, x2,..., xm, y), the task is to infer y when (x1, x2,..., xm) is known, Although the distribution is unknown, so the relation between (x1, x2, ..., xm) and y is not known, we can take a sample from the distribution, and get the relation between (x1, x2,..., xm) and y of the sample, since each (x1, x2,..., xm, y) is known in the sample, then use the relation between (x1, x2, ..., xm) and y of the sample to estimate the relation between (x1, x2, ..., xm) and y of the distribution based on some statistic rules.
2. So with (X, y) known in the sample, how to get f:X->y of the sample?
Note: X = vector {x1, x2, ..., xm}
The answer is universal approximator, namely a function with parameters, which can fit any function to any precision with combination of enough units.
Some universal approximators are:
neural network: its unit is one node in one layer, whose mathematical form is
activation(wx + b), where activation is a non-linear function such as relu, sigmoid etc. The purpose of activation is introducing non-linearity in neural network. neural network which has one layer with enough units also can fit any function, but compared to neural network which has multiple layers, it needs more units based on experiences, more units = more parameters, more parameters = larger estimation error, the reason will be illustrated later, so usually neural network with multiple layers is used, It is also called deep learning.
decision forest: its unit is one node in one decision tree.Though one decision tree with enough nodes also can fit any function, using multiple decision trees (namely decision forest) is better in relieving estimation error based on experiences.
polynomial: its unit is \(x^k\).
With one universal approximator with parameters (this is also called a model), compute the parameters by fitting it to the sample, namely, assuming the model is: $ y_{infer} = f(x; w) $, where $ w $ is parameter, $ w = \underset{w} {argmin} |y_{true} - y_{infer}| $. Namely, to get parameter, we just need to minimize $ |y_{true} - y_{infer}| $, this is also called fittig.
Minimization is also called optimization in mathematics. Smooth function, namely who has gradient everywhere, is easy to deal with. $ f(y_{infer} = |y_{true} - y_{infer}| $ is not a smooth function, $f(y_{infer}) = (y_{true} - y_{infer})^2 $ is a smooth function, and they have minimum at the same $ y_{infer}, and $ y_{infer} = f(x; w) $, so also at the same $ w $, so usually the objective function to optimize is $ f(w} = (y_{true} - y_{infer})^2 $. Note, here y is assumed numeric, not categorical, the objective function to optimize for categorical y will be illustrated later. The objective function to optimize is also called loss function in Machine Learning.
3. with $ f: X \rightarrow y $ known for the sample, how to estimate the $ f: X \rightarrow y $ for the distribution?
3.1 PAC (Probably Approximately Correct) learnable
A model H is defined as PAC learnable if there exits a function $ m_H: (0, 1)^2 \rightarrow N $ and a learning algorithm with the following property:
for every $ \epsilon, \delta \in (0, 1) $, for every distribution D over (X, y), when running the learning algorithm on m \geq m_H(\epsilon, \delta) i.i.d examples generated by D, the algorithm returns a $ h: x \rightarrow y $, with probability of at least $ 1 - \delta $ (over the choice of the m training examples), $ L_{D}(h) \leq min\underset{h\in H} L_{D}(h^) + \epsilon $ .
Note:
$ L_D $ : loss of the distribution.
3.2 neural network and decision forest are PAC learnable.
3.2.1 Finite VC dimentsion $\Leftrightarrow $ PAC learnable in binary classification.
definition of VC dimension: VC dimension oF a model H IS the maximum size of a set $ C \subset X $ , that can be shattered by H.
definition of shattering: a model shatters a finite set $ C \subset X $ if the restriction of H to C is all functions from C to {0, 1}.
definition of restriction of H to C: the restriction of H TO C is the set of functions from C to {0, 1} that can be derived from H, namely,
$ H_C = {h(c1), h(c2),..., h(cm)): h \in H} $.
3.2.2 the fundamental theorem of statistical learning:
Let H be a model from a domain X to {0,1}, and let the loss function be the 0-1 loss. Assume that VCdim(H) = d < \infinite, then, there are absolute constants C1, C2 such that:
H is agnostic PAC learnable with sample complexity:
$ C_1{d + log(1/\delta) \div {\epsilon^2} \leq m_H(\epsilon, \delta) \leq C_2 {d + log(1/\delta) \div {epsilon^2} $
Note: Though the upper theorem is discussed in the book for binary classification, based on my experience, it is ok to generalize to y is multiple categorical or numeric.
3.2.3 from the fundamental theorem of statistical learning, it can be seen that sample complexity increase as VC dimension increases. And for neural network and decision forest, VC dimension increases as number of parameters increases.
For neural network, assuming that there are m training examples, if y is numeric, it is known that m points can at most decide m parameters. if y is categorical, m points can at most put restriction on m parameters, so if there are more parameters than m, such as m+1, m training examples can not put any restriction on the m+1 parameter, namely, the m+1 parameter can be any number without influencing fitting training examples, but to fit the distribution, it has restriction on the m+1 parameter, so number of parameters can not be more than number of examples, or the learned model can not fit the distribution well at all since the m+1 parameter is set randomly.
3.2.4 tradeoff between larger model capacity to reduce approximation error and smaller model capacity to reduce estimation error.
$ L_D(A(S)) = L_D(A(S)) - L_D(h^) + L_D(h^) = L_D(A(S)) - l_V(A(S)) + L_V(A(S)) - L_S(A(S)) + L_S(A(S))$
$ L_D(h^) $ is defined as approximation error, is $ min \underset {h in H } l_D(h) \(.
\) L_D(A(S)) - L_D(h^) $ is defined as estimation error.
A(S) is learned model by algorithm H from sample S, in model H.
The goal of Machine Learning is to get small enough $ L_D(A(S)) $, namely both $ l_D(h^) $ and $ L_D(A(S)) - L_D(h^) $ need to be small enough.
So when $ L_D(h^) $ will be small? $ L_D(h^) $ will be small if H contains a function who can fit sample S well.
Of course, the capacity of H larger, the probability of containing a function who can fit sample S well is higher, But, the capacity of H larger, lager capacity of H = larger number of parameters = larger VC dimension = larger sample complexity = with a fixed sample size, and fixed \epsilon, $ \delta = L_D(A(S)) - L_D(h^) $ is larger. So need a good tradeoff between small $ L_D(h^) $ and small $ L_D(A(S)) - L_D(h^) $, namely, need to choose a suitable H capacity, a good way is to make $ L_D(h^) $ small enough first, then if at this point, $ L_D(A(S)) - L_D(h^) $ is large, this means the capacity of H is too large, reduce the capacity little by little to see at which capacity, both $ L_D(h^) $ and $ L_D(A(S)) - L_D(h^*) $ can be small enough, If the good tradeoff which can be gotten still is not good enough result, this means that the sample size is not enough, this sample size can get this good result at most, if want better result, get more training examples.
In general, as long as the approximation error is greater than zero we expect the training error to grow with the sample size, as a larger amount of data points makes it harder to provide an explanation for all of them. On the other hand, the validation error tends to decrease with the increase in sample size. If the VC dimension is finite, when the sample size goes to infinity, the validation and train errors converge to the approximation error. Therefor, by extrapolating the training and validation curves we can try to guess the value of the approximation error, or at least to get a rough estimate on an interval in which the approximation error resides.
$ L_D(A(S)) - l_V(A(S)) $ can be tight bounded according to Hoeffding's ineqality:
Let Z_1,..., Z_m be a sequence of i.i.d random variables, TAssume that $ E[Z] = \miu $ and P(a \leq Z_i \leq b] = 1 for every i. Then, for any \epsilon > 0:
$ P[|1/m \sum_{i=1}^m Z_i - \miu| > \epsilon ] \leq 2exp(-2m\epsilon^2/(b - a)^2) .
This is tighter bound for L_D than the bound for L_D in definition of agnostic PAC learnable, using sample complexity, which can be calculated based on the fundamental theorem of statistical learning.
For example, assuming $ loss \in [0, 1] $, VC dimension d = 100, $\epsilon = 0.1 $, $ \delta = 0.1 $ according to the fundamental theorem of statistical learning:
$ C_1{d + log(1/\delta) \div {\epsilon^2} \leq m_H(\epsilon, \delta) \leq C_2 {d + log(1/\delta) \div {epsilon^2} $
$ C_1 \times 10,230.3 \leq m_H^{agnostic PAC}(\epsilon, \delta) \leq C_2 \times 10,230.3
$ P[L_D(A(S)) \leq l_D(h^*) + \delta] geq {1 - \epsilon}
$ P[[|L_V(A(S)) - L_D(A(S))| \leq \epsilon] \geq {1 - 2exp(-2m\epsilon2/(b-a)2)} $,
if m_V = 300, $ epsilon = 0.1 $, then:
$ P[[|L_V(A(S)) - L_D(A(S))| \leq \epsilon] \geq 0.995 $
Note:
The case of large $ L_D(h^) $ is called underfit, the case of large $ L_D(A(S)) - L_D(h^) $ is called overfit.
3.2.5 methods to relieve overfit for neural network
3.2.5.1 Regularization
3.2.5.2 Dropout
3.2.5.3 Early Stopping
Gradient Descent Optimization updates W at the direction of negative gradient, since this is the direction the objective function decreases fastest, namely, if fix $ ||\Delta w ||_2 $, the ojective function decreases most in this direction. It fits the main trend of the sample first since the main trend is of most examples, so fitting the main trend decreases the loss (namely objective function of Machine Learning) most,and the main trend is consistent with the distribution since the examples in the sample is generated from the distribution, then it fits the tiny detail of individual examples, which is specific for the sample. not general for the distribution, fitting this leads to overfit, so when it starts to fit this, namely when loss of validation set not improve any more, stopping the optimization is good, this is called early stopping.