Chapter 4: Supervised Learning
Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".
Linear Regression
就是一个很传统的统计学任务。
用最小二乘法可知,\(w^* = (X^\top X)^{-1}X^{\top}Y\). But the inverse is hard to compute.
We can use gradient descent. Define the square loss:
\[L(f, x_i, y_i) = \frac{1}{2}(f(x_i)-y_i)^2 \]What if the task is not regression but classification?
Consider binary classification at first. A naive way is to use positive and negative to represent 2 classes. However, the gradient can not give meaningful gradient information.
方法一:没有导数我就设计一种不需要导数信息的算法——感知机
方法二:把硬的变软,人为创造出导数——logistics regression
Perceptron
Intuition: adding \(x\) to \(w\) will make \(w^x\) larger.
Limitation: it can only learn linear functions, so it does not converge if data is not linearly separable.
Update rule: Adjust \(w\) only when make mistakes. If \(y > 0\) and \(wx<0\):
\[w = w + x \]If \(y < 0\) and \(wx>0\):
\[w = w - x \]combine these 2 into if \(ywx < 0\) \((y_i \in \{-1, +1\})\)
\[w = w + y_ix_i \]Convergence Proof:
Assume \(\|w^*\| = 1\) and \(\exist ~\gamma > 0 ~~ \text{s.t.} ~~ \forall ~ i ~~ y_i w^*x_i \ge \gamma\).
And \(\|x_i\| \le R\). Then at most \(\frac{R^2}{\gamma^2}\) mistakes.
Start from \(w_0 = 0\) Telescoping
\[\langle w_{t+1}, w*\rangle \ge t\gamma \]Then
\[\|w_{t+1}\| = \|w_{t+1}\| \|w^*\| \ge\langle w_{t+1}, w*\rangle\ge t\gamma \]On the other hand:
\[\begin{align*} \|w_{t+1}\|^2 &= \|w_{t} + y_t x_t\|^2 \\ &= \|w_{t}\|^2 + \|y_t x_t\|^2 + 2\langle y_tx_t, w_t\rangle\\ &\le \|w_{t}\|^2 + \|y_t x_t\|^2 ~~\text{(update only when mistake)}\\ &\le \|w_{t}\|^2 + R^2 ~~ \text{(R-condition)} \end{align*} \]Telescoping:
\[ \|w_{t+1}\|^2 \le tR^2 \]So
\[\begin{align*} t^2\gamma^2 &\le \|w_{t+1}\|^2 \le tR^2\\ t &\le \frac{R^2}{\gamma^2} \end{align*} \]Logistic Regression(逻辑回归)
将分类问题转化为概率的回归问题。
Instead of using a sign function, we can output a probability. Here comes the important thought we have used in matrix completion. Relaxation!!!
Make the hard function \(\text{sign}(z)\) soft:
\[\frac{1}{1+e^{-z}} \]It remains to define a loss function. L1 and L2 are both not good enough, let's come to cross entropy:
\[L(y, p) = -\sum_i y_i \log p_i \]Explanation:
-
We already know the actual probability distribution \(y\).
-
We estimate the difference of \(p_i\) and \(y_i\).
feature learning
Linear regression/classification can learn everything!
- If the feature is correct
In general, linear regression is good enough, but feature learning is hard
Deep learning is also called “representation learning” because it learns features automatically.
- The last step of deep learning is always linear regression/classification!
Regularization
This is a trick to avoid overfitting.
Ridge regression
\[\begin{align*} \min L &= \frac{1} {2N} \sum_i (w^Tx_i - y_i)^2 + \frac{\lambda} {2} \|w \|_2^2\\ \nabla_w L &= \frac{1} {N} \sum_i (w^Tx_i - y_i)x_i + \lambda w\\ H &= \frac{1}{N}\sum_ix_ix_i^T + \lambda I \ge \lambda \end{align*} \]This is \(\lambda\)-strongly convex.
An intuitive illustration of how this works. For each gradient descent step, split it into two parts:
- Part 1: Same as linear regression
- Part 2: “Shrink” every coordinate by \((1-\eta \lambda)\) (weight decay is a very important trick today)
Until two parts “cancel” out, and achieves equilibrium.
However, ridge regression can not find important features. It is essentially linear regression + weight decay.
Although the \(w\) vector will have a smaller norm, every feature may get some(possibly very small) weight
If we need to find important features, we need to optimize:
\[\min L = \frac{1} {2N} \sum_i (w^Tx_i - y_i)^2 \text{and at same time } \| 标签:le,frac,epsilon,sum,学习,监督,theta,机器,align From: https://www.cnblogs.com/yzc5827/p/17979892