机器学习第4章: 监督学习

时间：2024-01-22 14:00:47浏览次数：22

标签：le frac epsilon sum 学习监督 theta 机器 align

Chapter 4: Supervised Learning

Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".

Linear Regression

就是一个很传统的统计学任务。
用最小二乘法可知，\(w^* = (X^\top X)^{-1}X^{\top}Y\). But the inverse is hard to compute.

We can use gradient descent. Define the square loss:

\[L(f, x_i, y_i) = \frac{1}{2}(f(x_i)-y_i)^2 \]

What if the task is not regression but classification?

Consider binary classification at first. A naive way is to use positive and negative to represent 2 classes. However, the gradient can not give meaningful gradient information.

方法一：没有导数我就设计一种不需要导数信息的算法——感知机

方法二：把硬的变软，人为创造出导数——logistics regression

Perceptron

Intuition: adding \(x\) to \(w\) will make \(w^x\) larger.

Limitation: it can only learn linear functions, so it does not converge if data is not linearly separable.

Update rule: Adjust \(w\) only when make mistakes. If \(y > 0\) and \(wx<0\):

\[w = w + x \]

If \(y < 0\) and \(wx>0\):

\[w = w - x \]

combine these 2 into if \(ywx < 0\) \((y_i \in \{-1, +1\})\)

\[w = w + y_ix_i \]

Convergence Proof:

Assume \(\|w^*\| = 1\) and \(\exist ~\gamma > 0 ~~ \text{s.t.} ~~ \forall ~ i ~~ y_i w^*x_i \ge \gamma\).
And \(\|x_i\| \le R\). Then at most \(\frac{R^2}{\gamma^2}\) mistakes.

\[\begin{align*} \langle w_{t+1}, w^*\rangle &\ge \langle w_{t}, w^*\rangle + \langle y_tx_t, w^*\rangle ~~(\text{update rule})\\ &\ge \langle w_{t}, w^*\rangle + \gamma ~~(\text{margin assumption}) \end{align*} \]

Start from \(w_0 = 0\) Telescoping

\[\langle w_{t+1}, w*\rangle \ge t\gamma \]

Then

\[\|w_{t+1}\| = \|w_{t+1}\| \|w^*\| \ge\langle w_{t+1}, w*\rangle\ge t\gamma \]

On the other hand:

\[\begin{align*} \|w_{t+1}\|^2 &= \|w_{t} + y_t x_t\|^2 \\ &= \|w_{t}\|^2 + \|y_t x_t\|^2 + 2\langle y_tx_t, w_t\rangle\\ &\le \|w_{t}\|^2 + \|y_t x_t\|^2 ~~\text{(update only when mistake)}\\ &\le \|w_{t}\|^2 + R^2 ~~ \text{(R-condition)} \end{align*} \]

Telescoping:

\[ \|w_{t+1}\|^2 \le tR^2 \]

\[\begin{align*} t^2\gamma^2 &\le \|w_{t+1}\|^2 \le tR^2\\ t &\le \frac{R^2}{\gamma^2} \end{align*} \]

Logistic Regression(逻辑回归)

将分类问题转化为概率的回归问题。

Instead of using a sign function, we can output a probability. Here comes the important thought we have used in matrix completion. Relaxation!!!

Make the hard function \(\text{sign}(z)\) soft:

\[\frac{1}{1+e^{-z}} \]

It remains to define a loss function. L1 and L2 are both not good enough, let's come to cross entropy:

\[L(y, p) = -\sum_i y_i \log p_i \]

Explanation:

We already know the actual probability distribution \(y\).
We estimate the difference of \(p_i\) and \(y_i\).

feature learning

Linear regression/classification can learn everything!

If the feature is correct

In general, linear regression is good enough, but feature learning is hard

Deep learning is also called “representation learning” because it learns features automatically.

The last step of deep learning is always linear regression/classification!

Regularization

This is a trick to avoid overfitting.

Ridge regression

\[\begin{align*} \min L &= \frac{1} {2N} \sum_i (w^Tx_i - y_i)^2 + \frac{\lambda} {2} \|w \|_2^2\\ \nabla_w L &= \frac{1} {N} \sum_i (w^Tx_i - y_i)x_i + \lambda w\\ H &= \frac{1}{N}\sum_ix_ix_i^T + \lambda I \ge \lambda \end{align*} \]

This is \(\lambda\)-strongly convex.

An intuitive illustration of how this works. For each gradient descent step, split it into two parts:

Part 1: Same as linear regression
Part 2: “Shrink” every coordinate by \((1-\eta \lambda)\) (weight decay is a very important trick today)

Until two parts “cancel” out, and achieves equilibrium.

However, ridge regression can not find important features. It is essentially linear regression + weight decay.

Although the \(w\) vector will have a smaller norm, every feature may get some(possibly very small) weight

If we need to find important features, we need to optimize:

\[\min L = \frac{1} {2N} \sum_i (w^Tx_i - y_i)^2 \text{and at same time } \|

标签：le,frac,epsilon,sum,学习,监督,theta,机器,align
From： https://www.cnblogs.com/yzc5827/p/17979892

机器学习第3章: 泛化
Chapter3:GeneralizationTheory泛化理论想解决一个什么样的问题呢？已知\(L_{train}=\epsilon\),whatcanwesayon\(L_D\)(populationloss)?ThetraditionalwayissamplingfromDagaintogetatestsetandthenget\(L_{test}\).Wecanusetheorytogetg......
机器学习第1章: 概述
Chapter1:GeneralIntroductionAcknowledgment:MostoftheknowledgecomesfromYuanYang'scourse"MachineLearning".监督学习概况Supervisedlearningisanimportantsub-areaofmachinelearning.Input:\(X=(x_1,x_2,\ldots,x_N)\)Outpu......
小样本学习One-shot
2024/1/141.什么是One-shot单样本学习（One-shotlearning）是机器学习领域的一个研究方向，重点是让模型能够仅通过一个训练样本来学习信息。什么是一个训练样本：指的是模型训练过程中只使用一个或少量例子或数据点来学习一个特定类别或任务。如果实在难以理解可以找一篇论文直接......
Queue-Linked List Implementation【1月22日学习笔记】
点击查看代码//Queue-LinkedListImplementation#include<iostream>usingnamespacestd;structnode{ intdata; node*next;};node*front=NULL;node*rear=NULL;//末指针·，则不用遍历整个链表，constanttimevoidEnqueue(intx){ node*temp=newnode; ......
初学者如何学习编程(从2014到2023年十年编程工作总结)
今天给大家分享一个话题，如何有效的学习编程，大家都知道，我是计算机专业毕业的，2008年开始学习编程，2014年研究生毕业后一直从事软件开发工作，先后在京东、爱奇艺、完美世界从事过软件开发工程师工作，具有十多年编程经验积累，所以我来讲这个话题，我是有发言权的，也具有一定的权威性。好的......
elasticsearch学习笔记1 - 安装
本次编写es笔记是为了记录学习到的es知识点，给大家一个快速理解和方便查找的地方。一、了解一下es是什么？为什么要使用es？因为系统一步一步运行，数据越来越多，每天产生的订单差不动2，3w的数据量，MYSQL数据的查询越来越吃力，然后领导要求能不能先办法解决一下。然后呢，在网......
Power BI - 5分钟学习新增度量值
每天5分钟，今天介绍PowerBI新增度量值在PowerBIDesktop中，你可以创建度量值。度量值用于计算表达式的结果。在创建自己的度量值时，需要使用DAX语言。DAX包括超过200个函数、运算符等，几乎可以计算任何数据分析所需的结果。下面以计算产品销售举例创建TotalSales的度量值，带大家......
神经网络优化篇：详解学习率衰减(Learning rate decay)
学习率衰减加快学习算法的一个办法就是随时间慢慢减少学习率，将之称为学习率衰减，来看看如何做到，首先通过一个例子看看，为什么要计算学习率衰减。假设要使用mini-batch梯度下降法，mini-batch数量不大，大概64或者128个样本，在迭代过程中会有噪音（蓝色线），下降朝向这里的最小值，但是不会精......
一起学习Avalonia
一起学习Avalonia(一）一起学习Avalonia(二）一起学习Avalonia(三）一起学习Avalonia(四）一起学习Avalonia补充（Linux下的使用开发）一起学习Avalonia(五）一起学习Avalonia补充（deepin下的使用开发t调试）一起学习Avalonia(六）一起学习Avalonia(七）一起学习Avalonia(八）一起学习Avalonia(九）一起学习A......
FastAPI学习-29 uvicorn 使用 log_config 参数设置 logger 日志格式
前言FastAPI服务是通过uvicorn来提供的，日志都是uvicorn里配置的。官方文档地址：https://www.uvicorn.org/settings/#logginguvicorn的logging日志我们可以通过uvicorn.run()方式启动服务uvicorn.run("example:app",port=5000,reload=True,access_log=False)于是可以加......