note_02
Keywords: Classification, Logistic Regression, Overfitting, Regularization
1 Motivation
Classification:
- "binary classification": \(y\) can only be one of two values
- class / category
Try using linear regression to do:
It seems work.
However, when there's another one sample point:
It will cause the \(x\) value, corresponding to the threshold value 0.5, moving to right, which is worse because some point are misclassified.
Logistic regression
- Could be used to solve this situation.
- It is actually used for binary classification problems despite of its name.
2 Logistic Regression
2.1 Conception
sigmoid / logistic function: \(g(z)=\frac{1}{1+e^{-z}}\), \(0<g(z)<1\)
Logistic regression:
\[f_{\vec{w},b}(\vec{x})=g(\vec{w}\cdot\vec{x}+b)=\frac{1}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \]Understanding logistic regression: the possibility of the class or label \(y\) will be equal to \(1\) given a certain input \(x\).
2.2 Decision Boundary
2.2.1 Threshold
2.2.2 Linear
2.2.3 Non-linear
2.3 Cost Function
2.3.1 Squared Error Cost
How to choose \(\vec{w}\) and \(b\) for logistic regression:
The squared error cost function is not a good choice:
Because it has many local minimum and is not really as smooth as the "soup bowl" from linear regression:
2.3.2 Logistic Loss
We set the Logistic Loss Function as:
\[J(\vec{w},b)=\frac{1}{m}\sum_{i=1}^mL(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) \\ L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) = \begin{equation} \left\{ \begin{array}{lr} -log(f_{\vec{w},b}(\vec{x}^{(i)})), & y^{(i)}=1\\ -log(1-f_{\vec{w},b}(\vec{x}^{(i)})), & y^{(i)}=0 \end{array} \right. \end{equation} \]To understanding it:
We can see the cost curve is much better:
2.3.3 Simplified Cost Function
We can write loss function as:
\[L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) = - y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) - (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) \]The cost function could be simplified as:
\[\begin{equation} \begin{aligned} J(\vec{w},b) &= \frac{1}{m}\sum_{i=1}^mL(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) \\ &= -\frac{1}{m}\sum_{i=1}^m[ y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) + (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) ] \end{aligned} \end{equation} \]2.4 Gradient Descent
Find \(\vec{w}\) and \(b\) to minimize the cost function.
Given new \(\vec{x}\), output
\[P(y=1|\vec{x};\vec{w},b)=f_{\vec{w},b}(\vec{x})=\frac{1}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \]2.4.1 Implementation
That's quite amazing that the partial derivative of cost function is same.
Derivation
Here, I do the derivation:
\[\begin{equation} \begin{aligned} \frac{\partial}{\partial{w_j}}J(\vec{w},b) &= \frac{\partial}{\partial{w_j}} \{-\frac{1}{m}\sum_{i=1}^m[ y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) + (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) ]\}\\ &= -\frac{1}{m}\sum_{i=1}^m\{ [ \frac{y^{(i)}}{f_{\vec{w},b}(\vec{x}^{(i)})} - \frac{1-y^{(i)}}{1-f_{\vec{w},b}(\vec{x}^{(i)})} ] \times \frac{\partial f_{\vec{w},b}}{\partial{w_j}} \}\\ &= -\frac{1}{m}\sum_{i=1}^m[ \frac{y^{(i)}-f_{\vec{w},b}}{f_{\vec{w},b}(1-f_{\vec{w},b})} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}\cdot{x_j^{(i)}}}{(1+e^{-(\vec{w}\cdot\vec{x}+b)})^2} ]\\ &= \frac{1}{m}\sum_{i=1}^m[ (f_{\vec{w},b}-y^{(i)}){x_j^{(i)}} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}}{ f_{\vec{w},b} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \times \frac{1}{f_{\vec{w},b}} \times (1+e^{-(\vec{w}\cdot\vec{x}+b)}) } ]\\ &= \frac{1}{m}\sum_{i=1}^m(f_{\vec{w},b}-y^{(i)}){x_j^{(i)}} \end{aligned} \end{equation} \]Other parts is similar:
3 Overfitting
3.1 Conception
underfitting
- doesn't fit the training set well
- high bias
just right
- fits training set pretty well
- generalization
overfitting
- fits the training set extremely well
- high variance
3.2 Addressing Overfitting
Method 1: Collect more training examples
Method 2: Feature Selection - Select features to include / exclude
Method 3: Regularization
- more gentler
- Preserve all features while preventing them from having a significant impact
是否对\(b\)进行正则化,没有啥影响。
4 Regularization
4.1 Conception
Modify the cost function. Intuitively, when \(w\) is small, the modified cost function can be smaller.
So, set the cost function as:
\[J(\vec{w},b)=\frac{1}{2m}\sum_{i=1}^{m}{f_{\vec{w},b}((\vec{x}^{(i)})-y^{(i)})^2} + \frac{\lambda}{2m}{\sum_{j=1}^{m}{w_j^2}} \]- \(\lambda\): regularization parameter, positive
- \(b\) can be included or excluded
Here you can see, if the \(\lambda\) is too small (let's say \(0\)), it will be underfitting; and if the \(\lambda\) is too big (let's say \(10^{10}\)), it will be overfitting.
4.2 Regularized Linear Regression
Gradient descent
repeat{
\[w_j := w_j-\alpha\frac{\partial}{\partial{w_j}}J(\vec{w},b) = w_j-\alpha[ \frac{1}{m} {\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}+\frac{\lambda}{m}w_j}] \\ b := b-\alpha\frac{\partial}{\partial{b}}J(\vec{w},b) = b-\alpha[\frac{1}{m}\sum_{i=1}^m{(f_{w,b}(x^{(i)})-{y}^{(i)})}] \]}
Here, rewrite the \(w_j\):
\[\begin{equation} \begin{aligned} w_j & = w_j-\alpha[\frac{1}{m}{\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}+\frac{\lambda}{m}w_j}] \\ & = w_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}{\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}} \end{aligned} \end{equation} \]We can know that the latter term of \(w_j\) is the usually-updating term, and the former one is to shrink \(w_j\).