标签：phi frac limits over CS229 mu theta

Introduction about ML

definitions

Arthur Samuel(1959): field of study that gives computers the ability to learn without being explicitly programmed
Tom Mitchell(1998): A computer program is said to be learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E

division

supervised learning
unsupervised learning

applications

organizing computing clusters
social network analysis
market segmentation
astronomical data analysis
cocktail party problem
autonomous helicopter

Linear Regression and Gradient Descent

some pre defs:
- $\theta$: parameters(learning algorithm need to generate)
- $m$: the number of training examples
- $n$: the number of features
- $x$: input, features (define that $x_0=1$)
- $y$: output, target
- $(x,y)$: training example
- $(x^{(i)},y^{(i)})$: the i-th training example
Learning algorithm:
- input: training set
- output: hypothesis (the thing used in classification/prediction)
  - input: data
  - output: the prediction
  - target: find $\theta$ s.t. $h(x) \approx y$ for training set
    - transformation: find $\argmin\limits_\theta J(\theta)$
      - $J(\theta)$: cost function
the representation of hypothesis:
- linear function:
  - one $x$: $h(x)=\theta_0+\theta_1x$
  - multiple $x$: $h_\theta(x)=\sum\limits_{j=1}^n \theta_j x_j$
    - vector version: $h_\theta(x)=\theta x^T$
      - $\theta=(\theta_{i-1})_{n+1}^T$
      - $x=(x_{i-1})_{n+1}^T$
linear regression
- def of $J(\theta)$: $J(\theta)=\frac 12\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$
- way to calc $J(\theta)$: gradient descent (batch gradient descent)
  1. choose a random $\theta$
  2. calc the gradient of $\theta$ over $x$
  3. update $\theta_j := \theta_j-\alpha{\partial \over \partial\theta_j}J(\theta)$
    - $\alpha$: learning rate (usually start with 0.01)(test several value and use the one with best performance)
    - $:=$: the representation of assignment
  - problem: slow to update $\theta_j$ when having large dataset
  - alternative: stochastic gradient descent:
    - principle: update all $\theta_j$ in $\theta$ with only one training data's $J(\theta)$
    - when to stop: when $J(\theta)$ stop going down
    - implementation:
```
while True:
    for j in range(m):
        theta[j]=theta[j]-alpha*(h(theta,x[i]),y[i])*x[i][j]
```
- the update equation in linear regression:
  - ${\partial \over \partial \theta_j}J(\theta)=\sum\limits_{i=1}^m{\partial \over \partial \theta_j}\frac 12(h_\theta(x^{(i)})-y^{(i)})^2=\sum\limits_{i=1}^n[(h_\theta(x^{(i)})-y^{(i)})*{\partial \over \partial \theta_j}(h_\theta(x^{(i)})-y^{(i)})]=\sum\limits_{i=1}^n x^{(i)}_j(h_\theta(x^{(i)})-y^{(i)})$
  - $\theta_j := \theta_j-\alpha\sum\limits_{i=1}^n x_j^{(i)}(h_\theta(x^{(i)})-y^{(i)})$
- feature of $J(\theta)$: no local optimum(like a big bowl) $\Rightarrow$ no error
special versio of linear regression (normal equation):
- feature:
  - only work for linear regression
  - could one-step jump to the global optimum
- get the equation:
  - some defs:
    - $\nabla_\theta J(\theta)=({\partial J \over \partial \theta_{i-1}})_{n+1}$
    - $f: \R^{m \times n} \rightarrow \R \Rightarrow \nabla_A f(A)=\lgroup{\partial f \over \partial A_{ij}}\rgroup_{m*n}$
    - ${\rm tr}A_n=\sum\limits_{i=1}^n A_{ii}$
      - $\nabla_A {\rm tr}AB=B^T$
      - ${\rm tr}ABC={\rm tr}CAB={\rm tr}BCA$
      - $\nabla_A {\rm tr}ABA^TC=CAB+C^TAB^T$
      - $\nabla_{A^T} f(A)=(\nabla_A f(A))^T$
      - $\nabla_A|A|=|A|(A^{-1})^T$
    - $X=((x^{(i)})^T)_m \Rightarrow \theta X=(h_\theta(x^{(i)}))_m$
    - $y=(y^{(i)})_m$
      - $J(\theta)=\frac 12(\theta X-y)(\theta X-y)^T$
  - proof: $$\begin{align}
    \nabla_\theta J(\theta)&=\nabla_\theta \frac 12(\theta X-y)(\theta X-y)^T\
    &=\frac 12\nabla_\theta(\theta X-y)(X^T\thetaT-y^T)\
    &=\frac 12\nabla_\theta(\theta XX^T\thetaT-yX^T\thetaT-\theta Xy^T+yyT)\
    &=\frac 12\nabla_\theta{\rm tr}(\theta XX^T\thetaT-yX^T\thetaT-\theta Xy^T+yyT)\
    &=\frac 12(\nabla_\theta{\rm tr}\theta XX^T\thetaT-2\nabla_\theta{\rm tr}\theta Xy^T+\nabla_\theta{\rm tr}yy^T)\
    &=\frac 12(\theta XX^T+\theta XX^T-2yXT)\
    &=\theta XX^T-yXT=\vec 0\
    \Rightarrow \theta&=yX^T(XXT)^{-1}
    \end{align}
  \[\]
Non-linear Regression: the linear combination of different features
- representation: $h_\theta(x)=\theta_0+\theta_1x_1+\theta_2\sqrt x+\theta_3\log x+...$
Local Weighted Regression
- terminology:
  - parametric learning algorithm: fir fixed set of parameters($\theta$) to data
  - non-parametric learning algorithm: the number of data/parameters you need to keep growing (linearly) with the size of data (not great for great dataset)
- implementation: use the local datas around the predict x to make regression and prediction
- formalize: fit $\theta$ to minimize $\sum\limits_{i=1}^m w^{(i)}(y^{(i)}-\theta^Tx^{(i)})^2$ where $w^{(i)}=e^{-\frac{(x^{(i)}-x)^2}{2\tau^2}}$（Gaussion Function）
  - $|x^{(i)}-x| \rightarrow 0$: $w^{(i)} \rightarrow 1$
  - $|x^{(i)}-x| \rightarrow \infty$: $w^{(i)} \rightarrow 0$
  - $\tau \rightarrow 0$: jagged fit
  - $\tau \rightarrow \infty$: over smoothing
why least squares? (maybe not actually true, but accurate enough)
- $y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}$ (thing assumed)
  - $\varepsilon^{(i)}$: error(unmodeled features, random noise, ...)
- $\varepsilon^{(i)} \sim N(0,\sigma^2)$ (thing assumed)
- $P(\varepsilon^{(i)})=\frac 1{\sqrt{2\pi}\sigma}e^{-{(\varepsilon^{(i)})^2 \over 2\sigma^2}}$
- $P(y^{(i)}|x^{(i)};\theta)=\frac 1{\sqrt{2\pi}\sigma}e^{-\frac {y^{(i)}-\theta^T x^{(i)}}2}$
  - "$;$": parametrized by
    - $y|x,\theta$: $y$ is conditioned by $x,\theta$
    - $y|x;\theta$: $y$ is conditioned by $x$ and parametrized by $\theta$
  - representation in another way: $(y^{(i)}|x^{(i)};\theta) \sim N(\theta^Tx^{(i)},\sigma^2)$
- IID(Independent and Indentity Distribution): the error term of two distributions are different(thing assumed)
- Likelihood of parameters: $L(\theta)=P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)=\prod\limits_{i=1}^m \frac 1{\sqrt{2\pi}\sigma}e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}}$
  - $P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)$: need IDD over every two distribution
  - $l(\theta)=\log L(\theta)=\log\prod\limits_{i=1}^m \frac 1{\sqrt{2\pi}\sigma}e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}}=\sum\limits_{i=1}^m (\log\frac 1{\sqrt{2\pi}\sigma}+\log e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}})=m\log\frac 1{\sqrt{2\pi}\sigma}-\sum\limits_{i=1}^m {(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}$
- MLE: maximum likelihood estimation
- target: choose $\theta$ to maximize $l(\theta) \Rightarrow$ choose $\theta$ to minimize $\frac 12\sum\limits_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2=J(\theta)$
classification:
- binary classification: dataset whose $y \in \{0,1\}$
  - bad to fit with linear regression
- logistic regression:
  - want: $h_\theta(x) \in [0,1]$
  - define: $g(z)=\frac 1{1+e^{-z}}$ (sigmoid function/logistic function)
    - increasing
    - val between $(0,1)$
  - $h_\theta(x)=g(\theta^Tx)=\frac 1{1+e^{-\theta^Tx}}$
  - assume:
    - $P(y=1|x;\theta)=h_\theta(x)$
    - $P(y=0|x;\theta)=1-h_\theta(x)$
  - combination: $P(y|x;\theta)=h_\theta(x)^y(1-h_\theta(x))^{1-y^{(i)}}$
  - MLE:
    - $L(\theta)=P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)=\prod\limits_{i=1}^m h_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}$
    - $l(\theta)=\log L(\theta)=\sum\limits_{i=1}^m (y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)})))$
  - batch gradient ascent: $\theta_j := \theta_j+\alpha{\partial \over \partial\theta_j}l(\theta)$
    - no error: only global maximum without local maximum
    - difference: try to maximize the function rather than minimize the function ($+/-,l/J$)
    - result: $\theta_j :=\theta_j+\alpha \sum\limits_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$
      - why the same as linear regression: defintion of $x^{(i)}$ don't change, but $h_\theta(x)$ changed, which didn't show on the surface
      - no something like normal equation
Newton's Method
- advantage: sometimes much faster than gradient descent
- Has: $f$
- want: find $\theta\ {\sf s.t.} f(\theta)=0$
  - also applied in maximum finding ($f'(\theta)=0$)
- update: $\theta^{(k+1)} := \theta^{(t)}-{f(\theta^{(t)}) \over f'(\theta^{(t)})}$
- property: quodratic conversions (error will decrease with square speed)
  - $0.01$ error $\rightarrow 0.0001$ error $\rightarrow 0.00000001$ error(each arrow need one step)
- update when $\theta \in \R^{n+1}: \theta^{(k+1)} := \theta^{(k)}+H^{-1}\nabla_\theta l(\theta)$
  - H: Hessian matrix ($\R^{n+1 \times n+1}$)

Perceptron & Generalized Linear Models

Perceptron algorithm (something applied to logistic regression):
- $g(z)=\begin{cases}1, z \geq 0\\0, z<0\end{cases}$
- $h_\theta(x)=g(\theta^Tx)$
- update rule: $\theta_j := \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$
  - $y^{(i)}-h_\theta(x^{(i)})$: a scalar
    - $0$: prediction is right
    - $\pm 1$: pridiction is wrong
      - $1: y^{(i)}=1$
      - $-1: y^{(i)}=0$
  - $\theta$: the norm of the classification line (points on one side is in class 0, and points on the other side is class 1)
    - $\Delta \theta_j: \alpha' x_j$
- can't solve the classification problem that couldn't represent the division by a line cross origin point
- often determine when to stop by you (since it may not find answer eventually)
exponential family: a class of probabilitistic distribution
- PDF(probability distribution function): $P(y;\eta)=b(y)e^{\eta^TT(y)-a(\eta)}={b(y)e^{\eta^TT(y)} \over e^{a(\eta)}}$
  - $y$: data
  - $\eta$: natural parameter
  - $T(y)$: sufficient statistic ($T(y)=y$ in this lecture)
  - $b(y)$: base measure
  - $a(\eta)$: log-partition function
- some example distributions:
  - Bernoulli Distribution (over binary data)
    - $\phi$: probability of event happening or not
    - $P(y;\phi)=\phi^y(1-\phi)^{1-y}=e^{\log(\phi^y(1-\phi)^{1-y})}=e^{y\log\frac\phi{1-\phi}+\log(1-\phi)}$
      - $b(y)=1$
      - $\eta=\log\frac\phi{1-\phi} \Rightarrow \phi=\frac 1{1+e^{-\eta}}$
      - $T(y)=y$
      - $\alpha(\eta)=-\log(1-\phi)=-\log(1-\frac 1{1+e^{-\eta}})=\log(1+e^\eta)$
  - Gaussian (with fixed varience) (over real data)
    - Assume $\sigma^2=1$
    - $P(y;\mu)=\frac 1{\sqrt{2\pi}}e^{-\frac{(y-\mu)^2}2}=\frac 1{\sqrt{2\pi}}e^{-\frac{y^2}2}e^{\mu y-\frac 12\mu^2}$
      - $b(y)=\frac 1{\sqrt{2\pi}}e^{-\frac{y^2}2}$
      - $\eta=\mu$
      - $T(y)=y$
      - $a(\eta)=\frac{\mu^2}2=\frac{\eta^2}2$
  - Poisson (over count) (distribution over integers)
  - Gamma,Exponential (over data in $\R^2$)
  - Beta, Direchlet (over distribution data)
- the nice mathematical properties:
  - MLE with respect to $\eta$ is concave, NLL(negative log likelihood) is convex
  - $E(y;\eta)=\frac\partial{\partial\eta}a(\eta), D(y;\eta)={\partial^2 \over \partial\eta^2}a(\eta)$
    - why good: most $E$s and $D$s needs integration, but this just need differentiation
    - $\eta$ is a vector $\Rightarrow$ partial becomes Hessian
Generalized Linear Models: (GLM)
- Assumptions/Designed Choice:
  - $y|x;\theta \sim F(\eta)$, where $F(\eta)$ is in exponential family
  - $\eta=\theta^Tx,\theta \in \R^n,x \in \R^n$
  - at test time: output is $E(y|x;\theta)$ $\big(h_\theta(x)=E(y|x;\theta)\big)$
- use: choose $b,a,T$ based on the data
  - train: find $\argmax\limits_\theta\ \log P(y^{(i)};\theta^Tx^{(i)})$
  - test: $E(y;\eta)=E(y;\theta^Tx)$
- learning update rule: $\theta_j := \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$
  - batch gradient descent: $\theta_j := \theta_j+\alpha\sum\limits_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$
  - Newton's Method: need data less than $1000$
- some terminologies:
  - canonical response function(CRF): $\mu=g(\eta)=E(y;\eta)$
  - canonical link function(CLF): $\eta=g^{-1}(\mu)$
- parametizations:
  - model param: $\theta$
  - canonical param:
    - $\phi$ for Bernoulli
    - $\mu,\sigma^2$ for Gaussian
    - $\lambda$ for Poisson
  - natural param: $\eta$
    - link with model param: $\eta=\theta^Tx$
    - link with canonical param: $g/g^{-1}$ (CRF/CLF)
- the distribution of regressions:
  - linear regression: Gaussian
  - logistic regression: Bernoulli
visualization of GLM:
- data generation: data was generated over distributions over $\eta Oy$
  - Gaussian: $\eta$ axis is the position of $\mu$ of distribution
  - Bernoulli: $O$ is the cross point of $x$ axis and $\eta$
Softmax Regression (a member of exponential family) (Cross Entropy)
- defs:
  - $K$: the number of classes
  - data: $x^{(i)} \in \R^n$
  - labels: $y^{(i)}=\{0,1\}^K$ (only have one $1$ and others are all $0$)
    - $c$: the position $j$ where $y_j=1$
  - param: $\theta_{class} \in \R^n (class \in classes)$
    - $classes$: the set of all possible class
  - $\theta=(\theta_i^T)_K \in \R^{K \times n}$
- representation: a set of lines (one line for each class) (one side $\Leftrightarrow$ in the class, the other side $\Leftrightarrow$ not in the class)
- predicted distribution(hypothesis function): $\hat p(y)={e^{\theta_y^Tx} \over \sum\limits_{i \in classes}e^{\theta_i^Tx}}$ (exp+normalization) (a distribution over $K$ classes)
  - why exp: $\R$ to $\R_+$
  - why normalization: $\R_+$ to $[0,1]$
- target distribution: $p(y)=\begin{cases}1,y=c\\0,y \neq c\end{cases}$
- cross entropy: the distance between $\hat p(y)$ and $p(y)$ ($J(\theta)$ in Linear Regression)
\[\begin{align*} CrossEnt(p,\hat p)&=\sum\limits_{y \in classes}p(y)\log\hat p(y)\\ &=-\log\hat p(c)\\ &=-\log{e^{\theta_c^Tx} \over \sum\limits_{i \in classes}e^{\theta_i^Tx}}\\ &=-\theta_c^Tx+\log\sum\limits_{i \in classes}e^{\theta_i^Tx} \end{align*} \]
- update: gradient descent towards cross entropy

GDA & Naive Bayes

Generative Learning Algorithm:
- basic principle: build model for each class and let the class of the input be the model output the max likelihood
- formalize:
  - discrimitive: learn $P(y|x)$ (or learn $h_\theta(x)=0/1$)
  - generative: learn $P(x|y)$($P(x|y=0)$ and $P(x|y=1)$) and $P(y)$(class prior)
    - Bayes rule: $P(y|x)={P(x|y)P(y) \over P(x)}$
      - $P(x)=P(x|y=1)P(y=1)+P(x|y=0)P(y=0)$
Gaussian Distributed Analysis(GDA): (A Generative Learning Algorithm)
- suppose $x \in \R^n$ (drop $x_0=1$ convention)
- Assume $P(x|y)$ is Gaussian
- some prerequirities:
  - Multivariate Gaussian Distribution: $z \sim N(\mu,\Sigma)$
    - $z \in \R^n$
    - $\mu \in \R^n$
    - $\Sigma \in \R^n$
    - $E(z)=\mu$
    - $Cov(z)=E((z-\mu)(z-\mu)^T)=E(zz^T)-E(z)E^T(z)$
    - $P(z)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(z-\mu)^T\Sigma^{-1}(z-\mu)}$
  - indentity function: $[true]=1,[false]=0$
- GDA model:
  - $P(x|y=0)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)}$
  - $P(x|y=1)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)}$
  - $P(y)=\phi^y(1-\phi)^{1-y}$ ($P(y=1)=\phi$)
- parameters: $\mu_0,\mu_1,\Sigma,\phi$
- how to fit parameters:
  - maximize joint likelihood
    - likelihood: $L(\phi,\mu_0,\mu_1,\Sigma)=\prod\limits_{i=1}^m P(x^{(i)},y^{(i)};\phi,\mu_0,\mu_1,\Sigma)=\prod\limits_{i=1}^m P(x^{(i)}|y^{(i)})P(y^{(i)})$
    - $l(\phi,\mu_0,\mu_1,\Sigma)=\log L(\phi,\mu_0,\mu_1,\Sigma)$
  - in discrimitive: maximize conditional likelihood:
    - likelihood: $L(\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)$
- fit result:
  - $\phi=\frac{\sum\limits_{i=1}^m y^{(i)}}m=\frac{\sum\limits_{i=1}^n [y^{(i)}=1]}m$
  - $\mu_0={\sum\limits_{i=1}^m[y^{(i)}=0]x^{(i)} \over \sum\limits_{i=1}^m [y^{(i)}=0]}$
  - $\mu_1={\sum\limits_{i=1}^m[y^{(i)}=1]x^{(i)} \over \sum\limits_{i=1}^m [y^{(i)}=1]}$
  - $\Sigma=\frac 1m\sum\limits_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T$
- prediction: $\argmax\limits_y P(y|x)=\argmax\limits_y {P(x|y)P(y) \over P(x)}=\argmax\limits_y P(x|y)P(y)$
- pros: quick when dataset is small
- why one $\Sigma$: decrease the amount of parameters and make the function linear
- Compare to Logictic Regression:
  - formal comparation:
    - GDA: (generative)
      - $x|y=0 \sim N(\mu_0,\Sigma)$
      - $x|y=1 \sim N(\mu_1,\Sigma)$
      - $y \sim Ber(\phi)$
    - logistic regression:
      - $P(y=1|x)=\frac 1{1+e^{-\theta^Tx}}$
  - for fixed $\phi,\mu_0,\mu_1,\Sigma$, lets plot $P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)$ to a function of $x$
  - $P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)={P(x|y=1;\mu_1,\Sigma)P(y=1,\phi) \over P(x;\phi,\mu_0,\mu_1,\Sigma)}$
  - $P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)$ when $x \in \R$
    - Logistic regression can be proved by GDA
    - GDA is a stronger assumption than logistic regression
    - GDA do better than logistic regression if the assumptions are correct
      - $P(x|y)$ distribution is poisson could also prove $P(y=1|x)$ is logistic
- how to choose: (general)
  - large number of data $\Rightarrow$ logistic regression
  - why still use GDA: computational efficient
Naive Bayes:
- applicagtion filed: text classification
- feature vector $x \in \{0,1\}^n$
  - $x_i=[$ word $i$ appears in email $]$
- if GDA: need $2^n$ parameters (too much)
- Assume $x$ is conditionally independent given $y$
  - $P(x_1,x_2,...,x_n|y)=\prod\limits_{i=1}^n P(x_i|x_1,...,x_{i-1},y)=\prod\limits_{i=1}^n P(x_i|y)$
  - maybe not true in mathematics, but not too horrible to give up
- parameters:
  - $\phi_{j|y=1}=P(x_j=1|y=1)$
  - $\phi_{j|y=0}=P(x_j=1|y=0)$
  - $\phi_y=P(y=1)$
- joint likelihood: $L(\phi_y,\phi_{j|y})=\prod\limits_{i=1}^m P(x^{(i)},y^{(i)};\phi_y,\phi_{j|y})$
  - MLE:
    - $\phi_y=\frac{\sum\limits_{i=1}^m [y^{(i)}=1]}m$
    - $\phi_{j|y=1}={\sum\limits_{i=1}^m [x_j^{(i)}=1,y^{(i)}=1] \over \sum\limits_{i=1}^m[y^{(i)}=1]}$
    - $\phi_{j|y=0}={\sum\limits_{i=1}^m [x_j^{(i)}=1,y^{(i)}=0] \over \sum\limits_{i=1}^m[y^{(i)}=0]}$
- actually not so bad (update while testing)
- problem: may have $0$ in equations (if have not been seen)
  - solution: Laplas smoothing
    - for $X \in \{i\}_k$, estimate $P(X=j)={\sum\limits_{j=1}^m[x^{(i)}=j]+1 \over m+k}$
    - in naive Bayes($\phi_{x|y=0}$ for example): $\phi_{x|y=0}={\sum\limits_{i=1}^m[x_j^{(i)}=1,y^{(i)}=0]+1 \over \sum\limits_{j=1}^m [y^{(i)}=0]+2}$
- applied in multinormial: $P(x|y)=\prod\limits_{j=1}^m P(x_j|y)$
  - a new representation for text feature: $x=(x_i)_n$ (multinomial event model)
    - pre feature: Multivariate Bernoulli event model
    - $n$: the length of the text
    - $x_i$: the i-th word's index in the dictionary
  - parameters:
    - $\phi_y=P(y=1)$
    - $\phi_{k|y=0}=P(\bigvee\limits_{j=1}^m(x_j=k)|y=0)$
    - $\phi_{k|y=1}=P(\bigvee\limits_{j=1}^m(x_j=k)|y=1)$
  - MLE parameters:
    - $\phi_{k|y=0}={\sum\limits_{i=1}^m \left([y^{(i)}=0]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]\right) \over \sum\limits_{i=1}^m[y^{(i)}=0]n_i}$
    - $\phi_{k|y=1}={\sum\limits_{i=1}^m \left([y^{(i)}=1]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]\right) \over \sum\limits_{i=1}^m[y^{(i)}=1]n_i}$
    - $\phi_y={\sum\limits_{i=1}^m [y^{(i)}=1] \over m}$
  - MLE after smoothing:
    - $\phi_{k|y=0}={\sum\limits_{i=1}^m \left([y^{(i)}=0]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]+1\right) \over \sum\limits_{i=1}^m[y^{(i)}=0]n_i+10000}$ (10000 is the number of possible values of x)
    - $\phi_{k|y=1}={\sum\limits_{i=1}^m \left([y^{(i)}=1]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]+1\right) \over \sum\limits_{i=1}^m[y^{(i)}=1]n_i+10000}$
    - $\phi_y={\sum\limits_{i=1}^m [y^{(i)}=1] +1\over m+2}$
- extension: another word-expressing technique: word embedding

Support Vector Machine

optimal margin classifier
- functional margin
  - Asssume $h_\theta(x)=g(\theta^Tx),g(z)=\frac 1{1+e^{-z}}$
  - Predict $1$ if $\theta^Tx \geq 0$ and $0$ otherwise
  - If $y^{(i)}=1$, hope that $\theta^Tx \gg 0$
  - If $y^{(i)}=0$, hope that $\theta^Tx \ll 0$
- Geometric margin: the separate line which has higher average distance to data points mix better
  - what optimal margin classifier do: try to find the separate line with the highest average distance
- notations:
  - $y \in \{-1,+1\}$
  - have $h$ output values in $\{-1,1\}$
  - $g(z)=[z \geq 0]-[z<0]$
  - $h_\theta(x)=g(w^Tx+b)$
    - $x \in \R^n$ (drop $x_0=1$ convention)
    - $b \in \R$
  - $\theta=(\theta_{i-1})_n^T$
    - $b=\theta_0$
    - $w=(\theta_i)^T_n$
- function margin of hyperplane defined by $(w,b)$: $\hat\gamma$
  - $\hat\gamma^{(i)}=y^{(i)}(w^Tx^{(i)}+b)$
    - If $y^{(i)}=1$, then want $w^Tx^{(i)}+b \gg 0$
    - If $y^{(i)}=-1$, then want $w^Tx^{(i)}+b \ll 0$
  - summary: hope $\hat\gamma^{(i)} \gg 0$
    - If $\hat\gamma^{(i)}>0$, that means $h(x^{(i)})=y^{(i)}$
  - functional margin respect to the training set: $\hat\gamma=\min\limits_{\xi \in [1,m] \cap \Z} \hat\gamma^{(i)}$
- a way to cheat functional margin: multiply $w$ and $b$ with the same value, then $\hat\gamma$ scales
  - solution: $(w,b) \rightarrow (\frac w{\parallel w\parallel},\frac b{\parallel w\parallel})$
- geometric margin: the distance between $(x^{(i)},y^{(i)})$ and line $w^Tx+b=0$
  - formalize: geometric margin of plane $(w,b)$ with $(x^{(i)},y^{(i)})$
    - $\gamma^{(i)}={w^Tx+b \over \parallel w\parallel}$
    - more generally: $\gamma^{(i)}={y^{(i)}(\omega^Tx^{(i)}+b) \over \parallel w\parallel}$
      - $\gamma^{(i)}={\hat\gamma^{(i)} \over \parallel w\parallel}$
  - geometric margin with training set: $\gamma=\min\limits_i \gamma^{(i)}$
- optimal margin classifier: choose $w,b$ to maximize $\gamma$
  - one implement way: $\max\limits_{\gamma,w,b}\gamma\ {\sf s.t.}\ {y^{(i)}(w^Tx^{(i)}+b) \over \parallel w\parallel} \geq \gamma \Rightarrow \min\limits_{w,b}\parallel w\parallel^2\ {\sf s.t.}\ y^{(i)}(w^Tx^{(i)}+b) \geq 1$
    (convex optimize problem)
    更新中。。。

标签：phi,frac,limits,over,CS229,mu,theta
From： https://www.cnblogs.com/Sherlocked-hzoi/p/17326080.html

CS229

Introduction about ML

definitions

division

applications

Linear Regression and Gradient Descent

Perceptron & Generalized Linear Models

GDA & Naive Bayes

Support Vector Machine

相关文章

赞助商

阅读排行