Officially begin
Deep = Many hidden layers
Neurall Network
Find a function in function set.
Goodness of function
Pick the best function
Backpropagation - Backward Pass(反向传播)
反向的neural network
Regression
- Stock Market Forecast
- Self-driving Car
- Recommendation
Step 1 : Model
A set of function
Step 2 : Goodness of Function
$$ \hat{y}^1代表x^1对应的确切值 $$
Loss function L:
$$
L(f)=L(w,b) ~ Estimated ~ y ~ basedoninputfunction
$$
$$ L(w,b)=\sum_{n=1}^{10}(\hat{y}^n-(b+w\cdot x_{cp}^n))^2 $$
Step 3 :Best Function
In linear regression, the loss function L is convex.
Overfitting
Regularization
$$ L(w,b)=\sum_{n=1}^{10}(\hat{y}^n-(b+w\cdot x_{cp}^n))^2+\lambda\cdot \sum(w_i)^2 $$
不需要考虑bias,调整平滑程度,smooth
- Gradient descent
- Overfitting and Regularization
Classification
independently and identically distributed(i.i.d) $$ L(h^{train},D_{all})-L(h^{all}, d_{all}) \leq \delta\ we\ need \ \forall h \in \H, |L(h,D_{train}) -L(h,D_{all}) | \leq \delta/2\ L(h^{train},D_{all})\leq L(h^{train},D_{all}) + \delta/2 $$ 重温数码宝贝:
模型出现bad的概率: $$ P(D_{train}\ is\ bad)\leq |H| \cdot 2exp(-2N\epsilon^2 ) \ N \ge \frac{log(2|H|/\delta)}{2\epsilon^2} $$ Tradeoff of Model Complexity
Training data for Classification
pair
Ideal Alternatives
Function(Model):
$$ f(x)\ x -> g(x)>0~Output=class1\ else\ Output=class2 $$
lossfunction:
The number of times of get incotrrect results on training data. $$ L(f) = \sum_{n}\delta(f(x^n)\neq\hat{y}^n) $$
Find the best function;
- Example : Perceptron, SVM
Prior
$$ P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)} $$
Gaussian
Maximum Likelihood
2D array or 3D array mean the array with 2 or 3 axes respectively, but the n-dimensional vector mean the vector of length n.
Learn something that can really differ you from others.
Logistics Regression
Function Set
$$ f_{w,b}=\sigma(\sum_{i}w_ix_i)+b $$
Output : Between 0 and 1 $$ f_{w,b}(x)=P_{w,b}(C_1|x) $$
$$ w^,b^=arg\ \underset{w,b}{max}L(w,b)\ 等同于 w^,b^ = arg\ \underset{w,b}{min}-lnL(w,b) $$
Cross Entropy: $$ Distribution \ p: p(x=1)=\hat{y}^n\ p(x=0)=1-\hat{y}^n\ Distribution \ q: q(x=1)=f(x^n)\ q(x=0)=1-f(x^n)\ H(p,q)=-\sum_xp(x)ln(q(x)) $$
Loss Function
$$ L(f)=\sum_nC(f(x^n),\hat{y}^n)\ C(f(x^n),\hat{y}^n)=-[\hat{y}^nlnf(x^n)+(1-\hat{y}^n)ln(1-f(x^n))] $$
Update
logistic regression 和 linear regression 形式完全相同 $$ w_i\gets w_i-\eta \sum_{n}-(\hat{y}^n-f_{w,b}(x^n))x_i^n $$
Discriminative (logistic) & Generative (Gaussian描述)
Generative做了某些假设。
- Benefit of generative model
- With the assumption of probability distribution, less training data is needed
- With the assumption of probability distribution, more robust to the noise
- Priors and class-dependent probabilities can be estimated from different sources.
Multi-class Classification
SoftMax $$ Softmax(z_i)=\frac{e^{z_i}}{\sum_{c=1}^{C} e^{z_c}}\ 1 > y_i' > 0\ \sum_iy_i'=1 $$
Limitation of Logistic Regression
只能画一条直线
- Feature Transformation
- Cascading logistic regression models
Optimization Issue
层数较多表现的反而没有层数较少的好
Over fitting
-
增加训练资料
-
Data augmentation
-
constrained model
- Less parameters, sharing parameters
- Less features
- Early stopping
CNN->比较没有弹性的model
分Training Set
- N-fold Cross Validation
Optimization Fail
H : Hessian
Tayler Series Approximation $$ L(\theta) \approx L(\theta^\prime)+\frac{1}{2}(\theta-\theta^\prime)^TH(\theta-\theta^\prime) $$
- H is positive definte = All eigen values are positive -> local minima
- H is negative definte = All eigen values are negative -> **local **
- Some eigen values are positive , and some are negative -> Saddle point
在高维下local minima可能会变成saddle poing
Tips for training : Batch and Momentum
Batch
1 epoch = see all the batches once -> shuffle after each epoch
Momentum
Movement not just based on gradient, but previous movement.
Different parameters needs different learning rate
$$ \theta_i^{t+1} \gets \theta_i^t-\frac{\eta}{\sigma_i^t}g_i^t\ \sigma_i^t=\sqrt{\frac{1}{t+1}\sum_{i=0}^t(g_i^t)^2} $$
Adagred
RMSProp
$$ \theta_i^{t+1} \gets \theta_i^t-\frac{\eta}{\sigma_i^t}g_i^t\ \sigma_i^t = \sqrt{\alpha(\sigma_i^{t-1})^2+(1-\alpha)(g_i^t)^2} $$
Adam : RMSProp + Momentum
Learning Rate Sceduling
$$ \theta_i^{t+1} \gets \theta_i^t-\frac{\eta^t}{\sigma_i^t}g_i^t\ $$
标签:function,frac,sigma,回归,分类,theta,hat,sum From: https://blog.51cto.com/u_16189732/7448443