Introduction about ML
definitions
- Arthur Samuel(1959): field of study that gives computers the ability to learn without being explicitly programmed
- Tom Mitchell(1998): A computer program is said to be learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E
division
- supervised learning
- unsupervised learning
applications
- organizing computing clusters
- social network analysis
- market segmentation
- astronomical data analysis
- cocktail party problem
- autonomous helicopter
Linear Regression and Gradient Descent
- some pre defs:
- \(\theta\): parameters(learning algorithm need to generate)
- \(m\): the number of training examples
- \(n\): the number of features
- \(x\): input, features (define that \(x_0=1\))
- \(y\): output, target
- \((x,y)\): training example
- \((x^{(i)},y^{(i)})\): the i-th training example
- Learning algorithm:
- input: training set
- output: hypothesis (the thing used in classification/prediction)
- input: data
- output: the prediction
- target: find \(\theta\) s.t. \(h(x) \approx y\) for training set
- transformation: find \(\argmin\limits_\theta J(\theta)\)
- \(J(\theta)\): cost function
- transformation: find \(\argmin\limits_\theta J(\theta)\)
- the representation of hypothesis:
- linear function:
- one \(x\): \(h(x)=\theta_0+\theta_1x\)
- multiple \(x\): \(h_\theta(x)=\sum\limits_{j=1}^n \theta_j x_j\)
- vector version: \(h_\theta(x)=\theta x^T\)
- \(\theta=(\theta_{i-1})_{n+1}^T\)
- \(x=(x_{i-1})_{n+1}^T\)
- vector version: \(h_\theta(x)=\theta x^T\)
- linear function:
- linear regression
- def of \(J(\theta)\): \(J(\theta)=\frac 12\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\)
- way to calc \(J(\theta)\): gradient descent (batch gradient descent)
- choose a random \(\theta\)
- calc the gradient of \(\theta\) over \(x\)
- update \(\theta_j := \theta_j-\alpha{\partial \over \partial\theta_j}J(\theta)\)
- \(\alpha\): learning rate (usually start with 0.01)(test several value and use the one with best performance)
- \(:=\): the representation of assignment
- problem: slow to update \(\theta_j\) when having large dataset
- alternative: stochastic gradient descent:
- principle: update all \(\theta_j\) in \(\theta\) with only one training data's \(J(\theta)\)
- when to stop: when \(J(\theta)\) stop going down
- implementation:
while True: for j in range(m): theta[j]=theta[j]-alpha*(h(theta,x[i]),y[i])*x[i][j]
- the update equation in linear regression:
- \({\partial \over \partial \theta_j}J(\theta)=\sum\limits_{i=1}^m{\partial \over \partial \theta_j}\frac 12(h_\theta(x^{(i)})-y^{(i)})^2=\sum\limits_{i=1}^n[(h_\theta(x^{(i)})-y^{(i)})*{\partial \over \partial \theta_j}(h_\theta(x^{(i)})-y^{(i)})]=\sum\limits_{i=1}^n x^{(i)}_j(h_\theta(x^{(i)})-y^{(i)})\)
- \(\theta_j := \theta_j-\alpha\sum\limits_{i=1}^n x_j^{(i)}(h_\theta(x^{(i)})-y^{(i)})\)
- feature of \(J(\theta)\): no local optimum(like a big bowl) \(\Rightarrow\) no error
- special versio of linear regression (normal equation):
- feature:
- only work for linear regression
- could one-step jump to the global optimum
- get the equation:
- some defs:
- \(\nabla_\theta J(\theta)=({\partial J \over \partial \theta_{i-1}})_{n+1}\)
- \(f: \R^{m \times n} \rightarrow \R \Rightarrow \nabla_A f(A)=\lgroup{\partial f \over \partial A_{ij}}\rgroup_{m*n}\)
- \({\rm tr}A_n=\sum\limits_{i=1}^n A_{ii}\)
- \(\nabla_A {\rm tr}AB=B^T\)
- \({\rm tr}ABC={\rm tr}CAB={\rm tr}BCA\)
- \(\nabla_A {\rm tr}ABA^TC=CAB+C^TAB^T\)
- \(\nabla_{A^T} f(A)=(\nabla_A f(A))^T\)
- \(\nabla_A|A|=|A|(A^{-1})^T\)
- \(X=((x^{(i)})^T)_m \Rightarrow \theta X=(h_\theta(x^{(i)}))_m\)
- \(y=(y^{(i)})_m\)
- \(J(\theta)=\frac 12(\theta X-y)(\theta X-y)^T\)
- proof: $$\begin{align}
\nabla_\theta J(\theta)&=\nabla_\theta \frac 12(\theta X-y)(\theta X-y)^T\
&=\frac 12\nabla_\theta(\theta X-y)(XT\thetaT-y^T)\
&=\frac 12\nabla_\theta(\theta XXT\thetaT-yXT\thetaT-\theta XyT+yyT)\
&=\frac 12\nabla_\theta{\rm tr}(\theta XXT\thetaT-yXT\thetaT-\theta XyT+yyT)\
&=\frac 12(\nabla_\theta{\rm tr}\theta XXT\thetaT-2\nabla_\theta{\rm tr}\theta Xy^T+\nabla_\theta{\rm tr}yy^T)\
&=\frac 12(\theta XX^T+\theta XXT-2yXT)\
&=\theta XXT-yXT=\vec 0\
\Rightarrow \theta&=yXT(XXT)^{-1}
\end{align}
- some defs:
- feature:
- Non-linear Regression: the linear combination of different features
- representation: \(h_\theta(x)=\theta_0+\theta_1x_1+\theta_2\sqrt x+\theta_3\log x+...\)
- Local Weighted Regression
- terminology:
- parametric learning algorithm: fir fixed set of parameters(\(\theta\)) to data
- non-parametric learning algorithm: the number of data/parameters you need to keep growing (linearly) with the size of data (not great for great dataset)
- implementation: use the local datas around the predict x to make regression and prediction
- formalize: fit \(\theta\) to minimize \(\sum\limits_{i=1}^m w^{(i)}(y^{(i)}-\theta^Tx^{(i)})^2\) where \(w^{(i)}=e^{-\frac{(x^{(i)}-x)^2}{2\tau^2}}\)(Gaussion Function)
- \(|x^{(i)}-x| \rightarrow 0\): \(w^{(i)} \rightarrow 1\)
- \(|x^{(i)}-x| \rightarrow \infty\): \(w^{(i)} \rightarrow 0\)
- \(\tau \rightarrow 0\): jagged fit
- \(\tau \rightarrow \infty\): over smoothing
- terminology:
- why least squares? (maybe not actually true, but accurate enough)
- \(y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}\) (thing assumed)
- \(\varepsilon^{(i)}\): error(unmodeled features, random noise, ...)
- \(\varepsilon^{(i)} \sim N(0,\sigma^2)\) (thing assumed)
- \(P(\varepsilon^{(i)})=\frac 1{\sqrt{2\pi}\sigma}e^{-{(\varepsilon^{(i)})^2 \over 2\sigma^2}}\)
- \(P(y^{(i)}|x^{(i)};\theta)=\frac 1{\sqrt{2\pi}\sigma}e^{-\frac {y^{(i)}-\theta^T x^{(i)}}2}\)
- "\(;\)": parametrized by
- \(y|x,\theta\): \(y\) is conditioned by \(x,\theta\)
- \(y|x;\theta\): \(y\) is conditioned by \(x\) and parametrized by \(\theta\)
- representation in another way: \((y^{(i)}|x^{(i)};\theta) \sim N(\theta^Tx^{(i)},\sigma^2)\)
- "\(;\)": parametrized by
- IID(Independent and Indentity Distribution): the error term of two distributions are different(thing assumed)
- Likelihood of parameters: \(L(\theta)=P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)=\prod\limits_{i=1}^m \frac 1{\sqrt{2\pi}\sigma}e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}}\)
- \(P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)\): need IDD over every two distribution
- \(l(\theta)=\log L(\theta)=\log\prod\limits_{i=1}^m \frac 1{\sqrt{2\pi}\sigma}e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}}=\sum\limits_{i=1}^m (\log\frac 1{\sqrt{2\pi}\sigma}+\log e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}})=m\log\frac 1{\sqrt{2\pi}\sigma}-\sum\limits_{i=1}^m {(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}\)
- MLE: maximum likelihood estimation
- target: choose \(\theta\) to maximize \(l(\theta) \Rightarrow\) choose \(\theta\) to minimize \(\frac 12\sum\limits_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2=J(\theta)\)
- \(y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}\) (thing assumed)
- classification:
- binary classification: dataset whose \(y \in \{0,1\}\)
- bad to fit with linear regression
- logistic regression:
- want: \(h_\theta(x) \in [0,1]\)
- define: \(g(z)=\frac 1{1+e^{-z}}\) (sigmoid function/logistic function)
- increasing
- val between \((0,1)\)
- \(h_\theta(x)=g(\theta^Tx)=\frac 1{1+e^{-\theta^Tx}}\)
- assume:
- \(P(y=1|x;\theta)=h_\theta(x)\)
- \(P(y=0|x;\theta)=1-h_\theta(x)\)
- combination: \(P(y|x;\theta)=h_\theta(x)^y(1-h_\theta(x))^{1-y^{(i)}}\)
- MLE:
- \(L(\theta)=P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)=\prod\limits_{i=1}^m h_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}\)
- \(l(\theta)=\log L(\theta)=\sum\limits_{i=1}^m (y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)})))\)
- batch gradient ascent: \(\theta_j := \theta_j+\alpha{\partial \over \partial\theta_j}l(\theta)\)
- no error: only global maximum without local maximum
- difference: try to maximize the function rather than minimize the function (\(+/-,l/J\))
- result: \(\theta_j :=\theta_j+\alpha \sum\limits_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
- why the same as linear regression: defintion of \(x^{(i)}\) don't change, but \(h_\theta(x)\) changed, which didn't show on the surface
- no something like normal equation
- binary classification: dataset whose \(y \in \{0,1\}\)
- Newton's Method
- advantage: sometimes much faster than gradient descent
- Has: \(f\)
- want: find \(\theta\ {\sf s.t.} f(\theta)=0\)
- also applied in maximum finding (\(f'(\theta)=0\))
- update: \(\theta^{(k+1)} := \theta^{(t)}-{f(\theta^{(t)}) \over f'(\theta^{(t)})}\)
- property: quodratic conversions (error will decrease with square speed)
- \(0.01\) error \(\rightarrow 0.0001\) error \(\rightarrow 0.00000001\) error(each arrow need one step)
- update when \(\theta \in \R^{n+1}: \theta^{(k+1)} := \theta^{(k)}+H^{-1}\nabla_\theta l(\theta)\)
- H: Hessian matrix (\(\R^{n+1 \times n+1}\))
Perceptron & Generalized Linear Models
- Perceptron algorithm (something applied to logistic regression):
- \(g(z)=\begin{cases}1, z \geq 0\\0, z<0\end{cases}\)
- \(h_\theta(x)=g(\theta^Tx)\)
- update rule: \(\theta_j := \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
- \(y^{(i)}-h_\theta(x^{(i)})\): a scalar
- \(0\): prediction is right
- \(\pm 1\): pridiction is wrong
- \(1: y^{(i)}=1\)
- \(-1: y^{(i)}=0\)
- \(\theta\): the norm of the classification line (points on one side is in class 0, and points on the other side is class 1)
- \(\Delta \theta_j: \alpha' x_j\)
- \(y^{(i)}-h_\theta(x^{(i)})\): a scalar
- can't solve the classification problem that couldn't represent the division by a line cross origin point
- often determine when to stop by you (since it may not find answer eventually)
- exponential family: a class of probabilitistic distribution
- PDF(probability distribution function): \(P(y;\eta)=b(y)e^{\eta^TT(y)-a(\eta)}={b(y)e^{\eta^TT(y)} \over e^{a(\eta)}}\)
- \(y\): data
- \(\eta\): natural parameter
- \(T(y)\): sufficient statistic (\(T(y)=y\) in this lecture)
- \(b(y)\): base measure
- \(a(\eta)\): log-partition function
- some example distributions:
- Bernoulli Distribution (over binary data)
- \(\phi\): probability of event happening or not
- \(P(y;\phi)=\phi^y(1-\phi)^{1-y}=e^{\log(\phi^y(1-\phi)^{1-y})}=e^{y\log\frac\phi{1-\phi}+\log(1-\phi)}\)
- \(b(y)=1\)
- \(\eta=\log\frac\phi{1-\phi} \Rightarrow \phi=\frac 1{1+e^{-\eta}}\)
- \(T(y)=y\)
- \(\alpha(\eta)=-\log(1-\phi)=-\log(1-\frac 1{1+e^{-\eta}})=\log(1+e^\eta)\)
- Gaussian (with fixed varience) (over real data)
- Assume \(\sigma^2=1\)
- \(P(y;\mu)=\frac 1{\sqrt{2\pi}}e^{-\frac{(y-\mu)^2}2}=\frac 1{\sqrt{2\pi}}e^{-\frac{y^2}2}e^{\mu y-\frac 12\mu^2}\)
- \(b(y)=\frac 1{\sqrt{2\pi}}e^{-\frac{y^2}2}\)
- \(\eta=\mu\)
- \(T(y)=y\)
- \(a(\eta)=\frac{\mu^2}2=\frac{\eta^2}2\)
- Poisson (over count) (distribution over integers)
- Gamma,Exponential (over data in \(\R^2\))
- Beta, Direchlet (over distribution data)
- Bernoulli Distribution (over binary data)
- the nice mathematical properties:
- MLE with respect to \(\eta\) is concave, NLL(negative log likelihood) is convex
- \(E(y;\eta)=\frac\partial{\partial\eta}a(\eta), D(y;\eta)={\partial^2 \over \partial\eta^2}a(\eta)\)
- why good: most \(E\)s and \(D\)s needs integration, but this just need differentiation
- \(\eta\) is a vector \(\Rightarrow\) partial becomes Hessian
- PDF(probability distribution function): \(P(y;\eta)=b(y)e^{\eta^TT(y)-a(\eta)}={b(y)e^{\eta^TT(y)} \over e^{a(\eta)}}\)
- Generalized Linear Models: (GLM)
- Assumptions/Designed Choice:
- \(y|x;\theta \sim F(\eta)\), where \(F(\eta)\) is in exponential family
- \(\eta=\theta^Tx,\theta \in \R^n,x \in \R^n\)
- at test time: output is \(E(y|x;\theta)\) \(\big(h_\theta(x)=E(y|x;\theta)\big)\)
- use: choose \(b,a,T\) based on the data
- train: find \(\argmax\limits_\theta\ \log P(y^{(i)};\theta^Tx^{(i)})\)
- test: \(E(y;\eta)=E(y;\theta^Tx)\)
- learning update rule: \(\theta_j := \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
- batch gradient descent: \(\theta_j := \theta_j+\alpha\sum\limits_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
- Newton's Method: need data less than \(1000\)
- some terminologies:
- canonical response function(CRF): \(\mu=g(\eta)=E(y;\eta)\)
- canonical link function(CLF): \(\eta=g^{-1}(\mu)\)
- parametizations:
- model param: \(\theta\)
- canonical param:
- \(\phi\) for Bernoulli
- \(\mu,\sigma^2\) for Gaussian
- \(\lambda\) for Poisson
- natural param: \(\eta\)
- link with model param: \(\eta=\theta^Tx\)
- link with canonical param: \(g/g^{-1}\) (CRF/CLF)
- the distribution of regressions:
- linear regression: Gaussian
- logistic regression: Bernoulli
- Assumptions/Designed Choice:
- visualization of GLM:
- data generation: data was generated over distributions over \(\eta Oy\)
- Gaussian: \(\eta\) axis is the position of \(\mu\) of distribution
- Bernoulli: \(O\) is the cross point of \(x\) axis and \(\eta\)
- data generation: data was generated over distributions over \(\eta Oy\)
- Softmax Regression (a member of exponential family) (Cross Entropy)
- defs:
- \(K\): the number of classes
- data: \(x^{(i)} \in \R^n\)
- labels: \(y^{(i)}=\{0,1\}^K\) (only have one \(1\) and others are all \(0\))
- \(c\): the position \(j\) where \(y_j=1\)
- param: \(\theta_{class} \in \R^n (class \in classes)\)
- \(classes\): the set of all possible class
- \(\theta=(\theta_i^T)_K \in \R^{K \times n}\)
- representation: a set of lines (one line for each class) (one side \(\Leftrightarrow\) in the class, the other side \(\Leftrightarrow\) not in the class)
- predicted distribution(hypothesis function): \(\hat p(y)={e^{\theta_y^Tx} \over \sum\limits_{i \in classes}e^{\theta_i^Tx}}\) (exp+normalization) (a distribution over \(K\) classes)
- why exp: \(\R\) to \(\R_+\)
- why normalization: \(\R_+\) to \([0,1]\)
- target distribution: \(p(y)=\begin{cases}1,y=c\\0,y \neq c\end{cases}\)
- cross entropy: the distance between \(\hat p(y)\) and \(p(y)\) (\(J(\theta)\) in Linear Regression)
- update: gradient descent towards cross entropy
- defs:
GDA & Naive Bayes
- Generative Learning Algorithm:
- basic principle: build model for each class and let the class of the input be the model output the max likelihood
- formalize:
- discrimitive: learn \(P(y|x)\) (or learn \(h_\theta(x)=0/1\))
- generative: learn \(P(x|y)\)(\(P(x|y=0)\) and \(P(x|y=1)\)) and \(P(y)\)(class prior)
- Bayes rule: \(P(y|x)={P(x|y)P(y) \over P(x)}\)
- \(P(x)=P(x|y=1)P(y=1)+P(x|y=0)P(y=0)\)
- Bayes rule: \(P(y|x)={P(x|y)P(y) \over P(x)}\)
- Gaussian Distributed Analysis(GDA): (A Generative Learning Algorithm)
- suppose \(x \in \R^n\) (drop \(x_0=1\) convention)
- Assume \(P(x|y)\) is Gaussian
- some prerequirities:
- Multivariate Gaussian Distribution: \(z \sim N(\mu,\Sigma)\)
- \(z \in \R^n\)
- \(\mu \in \R^n\)
- \(\Sigma \in \R^n\)
- \(E(z)=\mu\)
- \(Cov(z)=E((z-\mu)(z-\mu)^T)=E(zz^T)-E(z)E^T(z)\)
- \(P(z)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(z-\mu)^T\Sigma^{-1}(z-\mu)}\)
- indentity function: \([true]=1,[false]=0\)
- Multivariate Gaussian Distribution: \(z \sim N(\mu,\Sigma)\)
- GDA model:
- \(P(x|y=0)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)}\)
- \(P(x|y=1)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)}\)
- \(P(y)=\phi^y(1-\phi)^{1-y}\) (\(P(y=1)=\phi\))
- parameters: \(\mu_0,\mu_1,\Sigma,\phi\)
- how to fit parameters:
- maximize joint likelihood
- likelihood: \(L(\phi,\mu_0,\mu_1,\Sigma)=\prod\limits_{i=1}^m P(x^{(i)},y^{(i)};\phi,\mu_0,\mu_1,\Sigma)=\prod\limits_{i=1}^m P(x^{(i)}|y^{(i)})P(y^{(i)})\)
- \(l(\phi,\mu_0,\mu_1,\Sigma)=\log L(\phi,\mu_0,\mu_1,\Sigma)\)
- in discrimitive: maximize conditional likelihood:
- likelihood: \(L(\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)\)
- maximize joint likelihood
- fit result:
- \(\phi=\frac{\sum\limits_{i=1}^m y^{(i)}}m=\frac{\sum\limits_{i=1}^n [y^{(i)}=1]}m\)
- \(\mu_0={\sum\limits_{i=1}^m[y^{(i)}=0]x^{(i)} \over \sum\limits_{i=1}^m [y^{(i)}=0]}\)
- \(\mu_1={\sum\limits_{i=1}^m[y^{(i)}=1]x^{(i)} \over \sum\limits_{i=1}^m [y^{(i)}=1]}\)
- \(\Sigma=\frac 1m\sum\limits_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T\)
- prediction: \(\argmax\limits_y P(y|x)=\argmax\limits_y {P(x|y)P(y) \over P(x)}=\argmax\limits_y P(x|y)P(y)\)
- pros: quick when dataset is small
- why one \(\Sigma\): decrease the amount of parameters and make the function linear
- Compare to Logictic Regression:
- formal comparation:
- GDA: (generative)
- \(x|y=0 \sim N(\mu_0,\Sigma)\)
- \(x|y=1 \sim N(\mu_1,\Sigma)\)
- \(y \sim Ber(\phi)\)
- logistic regression:
- \(P(y=1|x)=\frac 1{1+e^{-\theta^Tx}}\)
- GDA: (generative)
- for fixed \(\phi,\mu_0,\mu_1,\Sigma\), lets plot \(P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)\) to a function of \(x\)
- \(P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)={P(x|y=1;\mu_1,\Sigma)P(y=1,\phi) \over P(x;\phi,\mu_0,\mu_1,\Sigma)}\)
- \(P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)\) when \(x \in \R\)
- Logistic regression can be proved by GDA
- GDA is a stronger assumption than logistic regression
- GDA do better than logistic regression if the assumptions are correct
- \(P(x|y)\) distribution is poisson could also prove \(P(y=1|x)\) is logistic
- formal comparation:
- how to choose: (general)
- large number of data \(\Rightarrow\) logistic regression
- why still use GDA: computational efficient
- Naive Bayes:
- applicagtion filed: text classification
- feature vector \(x \in \{0,1\}^n\)
- \(x_i=[\) word \(i\) appears in email \(]\)
- if GDA: need \(2^n\) parameters (too much)
- Assume \(x\) is conditionally independent given \(y\)
- \(P(x_1,x_2,...,x_n|y)=\prod\limits_{i=1}^n P(x_i|x_1,...,x_{i-1},y)=\prod\limits_{i=1}^n P(x_i|y)\)
- maybe not true in mathematics, but not too horrible to give up
- parameters:
- \(\phi_{j|y=1}=P(x_j=1|y=1)\)
- \(\phi_{j|y=0}=P(x_j=1|y=0)\)
- \(\phi_y=P(y=1)\)
- joint likelihood: \(L(\phi_y,\phi_{j|y})=\prod\limits_{i=1}^m P(x^{(i)},y^{(i)};\phi_y,\phi_{j|y})\)
- MLE:
- \(\phi_y=\frac{\sum\limits_{i=1}^m [y^{(i)}=1]}m\)
- \(\phi_{j|y=1}={\sum\limits_{i=1}^m [x_j^{(i)}=1,y^{(i)}=1] \over \sum\limits_{i=1}^m[y^{(i)}=1]}\)
- \(\phi_{j|y=0}={\sum\limits_{i=1}^m [x_j^{(i)}=1,y^{(i)}=0] \over \sum\limits_{i=1}^m[y^{(i)}=0]}\)
- MLE:
- actually not so bad (update while testing)
- problem: may have \(0\) in equations (if have not been seen)
- solution: Laplas smoothing
- for \(X \in \{i\}_k\), estimate \(P(X=j)={\sum\limits_{j=1}^m[x^{(i)}=j]+1 \over m+k}\)
- in naive Bayes(\(\phi_{x|y=0}\) for example): \(\phi_{x|y=0}={\sum\limits_{i=1}^m[x_j^{(i)}=1,y^{(i)}=0]+1 \over \sum\limits_{j=1}^m [y^{(i)}=0]+2}\)
- solution: Laplas smoothing
- applied in multinormial: \(P(x|y)=\prod\limits_{j=1}^m P(x_j|y)\)
- a new representation for text feature: \(x=(x_i)_n\) (multinomial event model)
- pre feature: Multivariate Bernoulli event model
- \(n\): the length of the text
- \(x_i\): the i-th word's index in the dictionary
- parameters:
- \(\phi_y=P(y=1)\)
- \(\phi_{k|y=0}=P(\bigvee\limits_{j=1}^m(x_j=k)|y=0)\)
- \(\phi_{k|y=1}=P(\bigvee\limits_{j=1}^m(x_j=k)|y=1)\)
- MLE parameters:
- \(\phi_{k|y=0}={\sum\limits_{i=1}^m \left([y^{(i)}=0]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]\right) \over \sum\limits_{i=1}^m[y^{(i)}=0]n_i}\)
- \(\phi_{k|y=1}={\sum\limits_{i=1}^m \left([y^{(i)}=1]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]\right) \over \sum\limits_{i=1}^m[y^{(i)}=1]n_i}\)
- \(\phi_y={\sum\limits_{i=1}^m [y^{(i)}=1] \over m}\)
- MLE after smoothing:
- \(\phi_{k|y=0}={\sum\limits_{i=1}^m \left([y^{(i)}=0]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]+1\right) \over \sum\limits_{i=1}^m[y^{(i)}=0]n_i+10000}\) (10000 is the number of possible values of x)
- \(\phi_{k|y=1}={\sum\limits_{i=1}^m \left([y^{(i)}=1]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]+1\right) \over \sum\limits_{i=1}^m[y^{(i)}=1]n_i+10000}\)
- \(\phi_y={\sum\limits_{i=1}^m [y^{(i)}=1] +1\over m+2}\)
- a new representation for text feature: \(x=(x_i)_n\) (multinomial event model)
- extension: another word-expressing technique: word embedding
Support Vector Machine
- optimal margin classifier
- functional margin
- Asssume \(h_\theta(x)=g(\theta^Tx),g(z)=\frac 1{1+e^{-z}}\)
- Predict \(1\) if \(\theta^Tx \geq 0\) and \(0\) otherwise
- If \(y^{(i)}=1\), hope that \(\theta^Tx \gg 0\)
- If \(y^{(i)}=0\), hope that \(\theta^Tx \ll 0\)
- Geometric margin: the separate line which has higher average distance to data points mix better
- what optimal margin classifier do: try to find the separate line with the highest average distance
- notations:
- \(y \in \{-1,+1\}\)
- have \(h\) output values in \(\{-1,1\}\)
- \(g(z)=[z \geq 0]-[z<0]\)
- \(h_\theta(x)=g(w^Tx+b)\)
- \(x \in \R^n\) (drop \(x_0=1\) convention)
- \(b \in \R\)
- \(\theta=(\theta_{i-1})_n^T\)
- \(b=\theta_0\)
- \(w=(\theta_i)^T_n\)
- function margin of hyperplane defined by \((w,b)\): \(\hat\gamma\)
- \(\hat\gamma^{(i)}=y^{(i)}(w^Tx^{(i)}+b)\)
- If \(y^{(i)}=1\), then want \(w^Tx^{(i)}+b \gg 0\)
- If \(y^{(i)}=-1\), then want \(w^Tx^{(i)}+b \ll 0\)
- summary: hope \(\hat\gamma^{(i)} \gg 0\)
- If \(\hat\gamma^{(i)}>0\), that means \(h(x^{(i)})=y^{(i)}\)
- functional margin respect to the training set: \(\hat\gamma=\min\limits_{\xi \in [1,m] \cap \Z} \hat\gamma^{(i)}\)
- \(\hat\gamma^{(i)}=y^{(i)}(w^Tx^{(i)}+b)\)
- a way to cheat functional margin: multiply \(w\) and \(b\) with the same value, then \(\hat\gamma\) scales
- solution: \((w,b) \rightarrow (\frac w{\parallel w\parallel},\frac b{\parallel w\parallel})\)
- geometric margin: the distance between \((x^{(i)},y^{(i)})\) and line \(w^Tx+b=0\)
- formalize: geometric margin of plane \((w,b)\) with \((x^{(i)},y^{(i)})\)
- \(\gamma^{(i)}={w^Tx+b \over \parallel w\parallel}\)
- more generally: \(\gamma^{(i)}={y^{(i)}(\omega^Tx^{(i)}+b) \over \parallel w\parallel}\)
- \(\gamma^{(i)}={\hat\gamma^{(i)} \over \parallel w\parallel}\)
- geometric margin with training set: \(\gamma=\min\limits_i \gamma^{(i)}\)
- formalize: geometric margin of plane \((w,b)\) with \((x^{(i)},y^{(i)})\)
- optimal margin classifier: choose \(w,b\) to maximize \(\gamma\)
- one implement way: \(\max\limits_{\gamma,w,b}\gamma\ {\sf s.t.}\ {y^{(i)}(w^Tx^{(i)}+b) \over \parallel w\parallel} \geq \gamma \Rightarrow \min\limits_{w,b}\parallel w\parallel^2\ {\sf s.t.}\ y^{(i)}(w^Tx^{(i)}+b) \geq 1\)
(convex optimize problem)
更新中。。。
- one implement way: \(\max\limits_{\gamma,w,b}\gamma\ {\sf s.t.}\ {y^{(i)}(w^Tx^{(i)}+b) \over \parallel w\parallel} \geq \gamma \Rightarrow \min\limits_{w,b}\parallel w\parallel^2\ {\sf s.t.}\ y^{(i)}(w^Tx^{(i)}+b) \geq 1\)
- functional margin