首页 > 其他分享 >机器学习第3章: 泛化

机器学习第3章: 泛化

时间:2024-01-22 13:34:58浏览次数:28  
标签:le frac 泛化 Big sum 学习 机器 sigma align

Chapter 3: Generalization Theory

泛化理论想解决一个什么样的问题呢?

已知\(L_{train} = \epsilon\), what can we say on \(L_D\) (population loss)?

  1. The traditional way is sampling from D again to get a test set and then get \(L_{test}\).
  2. We can use theory to get generalization bounds.

The latter is what generalization theory wants to do.

No free lunch theorem

It says that if we don’t have any prior knowledge of the data distribution, we can't design an algorithm that always works well. In other words, there is no universal learner. Well, based on this theorem, even human brains only work in certain scenarios. But this theorem considers a really weird setting(没有先验的学习label), which may not exist in practice.

Theorem 5.1 (no-free-lunch) Let \(A\) be any learning algorithm for the task of binary classification with respect to a loss within \([0, 1]\) over a domain \(X\). Let \(m\) be any number smaller than \(|X| /2\), representing a training set size. Then, there exists a distribution \(D\) over s.t.

  1. There exists a function \(f: X \rightarrow{} {0, 1}\) with \(L_D(f) = 0\)
  2. With probability of at least \(\frac{1}{7}\) over the choice of \(S \sim D^m\) (that is \(S\) is training set size \(m\)) we have that \(L_D(A(S)) \ge \frac{1}{8}\)

That is

\[P_{S \sim \mathcal D^m}\left(L_{\mathcal D}(A(S)) \geq \frac 18\right) \geq \frac 17 \]

首先我们需要明确一下这些符号的意思:

\(X\) 是原始数据的集合。 \(X \times \{0, 1\}\)是一个二分类的数据集空间,是数据集和标签集的直积。\(D\) 是对\(X \times \{0, 1\}\)这个数据集空间进行有概率地采样。也就是一个带标签的数据集。只有一些数据集是合法的,由条件1来控制。也就是至少要有一个\(f\)在这个数据集上能做到满分。

证明前我们先考虑这个条件,假设全集\(X\)大小为\(2m\),那么就有\(T=2^{2m}\) 种合法的函数(没有限制)能够把\(X\) 映射到 \(\{0, 1\}\) 这个二分类标签集。这些函数记作\(\{f_1, ..., f_T\}\), 由这些函数定义的\(D\) 记作\(\{D_1, ..., D_T\}\):

\[D_i(x_j, f_i(x_j)) = \frac{1}{2m}, \forall x_j \in X \]

也就是\(D_i\) 里面有\(2m\)条\(f_i\)标注的数据,每条等概率。

这样的\(T\) 个\(D\)上都满足条件(1),也就是存在一个全对的函数。

接下来的证明该怎么想呢?根据我们改写的这个概率形式,我们就会很自然的想到markov's inequality:

\[\Pr(x > a) \le \frac{E|x|}{a} \]

加上一个我们比较直观的intuition,就是一半的题目我见过我会做,一半的题目我只能蒙,二分类有一半概率蒙对,所以正确率是3/4

\[\begin{align*} E_{S\sim D^m} [1- L_D(A(S))] &\le \frac{3}{4} \\ \Pr (1-L_D(A(S)) > \frac{7}{8}) &\le \frac{E_{S\sim D^m} [1- L_D(A(S))]}{7/8} \le \frac{6}{7}\\ \Pr (L_D(A(S)) <\frac{1}{8}) &\le \frac{6}{7}\\ \Pr (L_D(A(S)) >\frac{1}{8}) &\ge \frac{1}{7}\\ \end{align*} \]

接下来的关键就在于证明\(E_{S\sim D^m} [1- L_D(A(S))]\le \frac{3}{4}\):

对于任意一个\(D_i\), 我都可以在其中取出\(k=(2m)^m\)次方个不同的训练集,记作\(\{S_1, ..., S_k\}\). 为了表示这个是从\(D_i\)中取出的,或者说是由\(f_i\) 标注的,加上上标\(i\), 记作\(\{S_1^i, ..., S_k^i\}\)

\[\begin{aligned} \max_{i \in [T]}\mathbb E_{S \sim \mathcal D_i^m} \Big(L_{\mathcal D_i} \big(A(S)\big) \Big) &= \max_{i \in [T]}\frac 1k \sum_{j=1}^k L_{\mathcal D_i}(A(S_j^i)) \\ &\geq \frac 1T \sum_{i=1}^T \frac 1k \sum_{j=1}^k L_{\mathcal D_i}(A(S_j^i)) \\ &\geq \min_{j \in [k]} \frac 1T \sum_{i=1}^T L_{\mathcal D_i}(A(S_j^i)) ~\text{max>均值>min} \end{aligned} \]

原本是对于不同的训练集求均值,现在是对于不同的标注方式求均值,事情就变得了简单了起来,因为训练元素固定,我们现在可以轻松的去定义,没有出现在训练集中的元素。定义这些元素为 \(\{v_1, ..., v_p\}\) 是没有出现在训练集中的元素. 因为有可能训练集多次取到了相同的元素,所以总数小于等于m, \(p \ge m\):

\[\begin{align*} L_{\mathcal D_i}(h) &= \frac 1{2m} \sum_{x \in C} [h(x) \neq f_i(x)]\\ & \geq \frac 1{2m}\sum_{r=1}^p \mathbb{1}[h(v_r) \neq f_i(v_r)] \\ & \geq \frac 1{2p} \sum_{r=1}^p \mathbb{1}[h(v_r) \neq f_i(v_r)] \end{align*} \]

So对所有\(D_i\) 求均值我们能得到:

\[\begin{align*} \frac 1T \sum_{i=1}^T L_{\mathcal D_i}(A(S_j^i)) & \ge \frac 1T \sum_{i=1}^T\frac 1{2p} \sum_{r=1}^p \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big]\\ &= \frac 1{2} \frac{1}{p}\sum_{r=1}^p\frac 1T \sum_{i=1}^T \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big]\\ &\ge \frac 1{2} \min_{r\in [p]} \frac 1T \sum_{i=1}^T \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big]\\ \end{align*} \]

不管你学到的\((A(S_j^i))(v_r)\)是什么, 遍历所有 \(f_i(v_r)\), 肯定是一半概率错一半概率对,所以\(\forall r\) 我们有:

\[\frac 1T \sum_{i=1}^T \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big] = \frac 12 \]

甚至都不用取\(\min\)就可以得到:

\[\begin{align*} \max_{i \in [T]}\mathbb E_{S \sim \mathcal D_i^m} \Big(L_{\mathcal D_i} \big(A(S)\big) \Big) &\ge \frac 1T \sum_{i=1}^T L_{\mathcal D_i}(A(S_j^i))\\ &\ge \frac 1{2} \frac 1T \sum_{i=1}^T \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big] = \frac{1}{4} \end{align*} \]

Finite hypothesis class and ERM

Class of H, the class of functions can be finite/infinite.

\(ERM_H\): pick the hypothesis with the smallest training loss

\[ERM_H(S) \in \arg \min _{h \in H} L_S(h) \]

If H is a infinite class, \(ERM_H\) may not give the best solution because of Overfitting (memorization).

Realizability assumption

To show this \(ERM\) algorithm is powerful, we raise the Realizability assumption:

There exists \(h^∗ \in H\) s.t. \(L_{D,f} (h^∗)= 0\).

Note that this assumption implies that with probability 1 over random samples, \(S\), where the instances of \(S\) are sampled according to \(D\) and are labeled by \(f\), we have \(L_{S} (h^*) = 0\)

Later we will see that it is not always possible to pick a function that perfectly solves the problem.

corollary 2.3.

It means, that if the problem is realizable, then the ERM hypothesis is good enough. 也就是达到的采样数量,ERM学到的函数就足够好。

Let \(H\) be a finite hypothesis class. Let \(\delta \in \{0,1\}\)and \(\epsilon>0\) and let \(m\) be an integer that satisfies \(m \ge \log(\frac{|H|}{\delta})/\epsilon\). Then, for any labeling function \(f\) and for any distribution \(D\), for which the realizability assumption holds (that is, for some \(h \in H\), \(L_{D,f}(h) = 0\)), with probability of at least \(1 − \delta\) over the choice of an iid sample \(S\) of size \(m\), we have them for every ERM hypothesis, \(h_S\), it holds that:

\[L_{D,f} (h_S) ≤ \epsilon \]

Proof:

We would like to bound the probability of the event \(L_{D, f}(h_S) > \epsilon\)

Let \(H_B\) be the bad hypothesis, that is:

\[H_B = \{h \in H: L_{D, f}(h) > \epsilon \} \]

But, since the realizability assumption implies that \(L_S(h_S) = 0\), 所以ERM取到的bad hypothesis 也要在S上拿到满分。

Let \(M\) be the training set that can train a bad hypothesis:

\[M = \{S: \exist h \in H_B,L_S(h) = 0\} \]

只有取到\(M\) 中的\(S\), 才有可能训出这样的bad hypothesis

\[\begin{align*} D^m(\{S: L_{D, f}(h_S) &> \epsilon\}) \le D^m(M) \\ &= D^m(\cup_{h\in H_B}\{S: L_S(h) = 0\}) ~~\text{(定义)}\\ &\le \sum_{h\in H_B}D^m(\{S: L_S(h) = 0\})~~\text{(union bound)} \end{align*} \]

第一个不等式是因为,如果这个S上训出来的\(h_S\)不好,那起码这要是一个能训出来bad hypothesis的S,小于号因为是必要条件。

好现在对于一个fix的\(h \in H_B\), 我们来选\(S\) 中的元素。\(D\)中有大于\(\epsilon\)的元素是\(h\)会预测错的,所以不能选。只能选剩下的的那$1- \epsilon $

\[D(\{x_i: h(x_i) = y_i\}) = 1-L_{D, f}(h) \le 1- \epsilon \]

\[\begin{align*} D^m(\{S: h_S(h) = 0\}) &= D^m(\{\forall x_i \in S: h(x_i) = y_i\}) (\text{选中一些他会的例子})\\ &= \prod_{i=1}^m D(\{x_i: h(x_i) = y_i\})(\text{独立事件})\\ &\le (1-\epsilon)^m \le e^{-m} \end{align*} \]

最后:

\[\sum_{h\in H_B}D^m(\{S: h_S(h) = 0\}) \le |H_B|e^{-m} \le |H|e^{-m}(\text{这也就是为什么要finite}) \]

要让\(|H|e^{-m} \le \delta\), 解得:

\[m \ge \frac{\log(\frac{|H|}{\delta})}{\epsilon} \]

PAC learnable

It’s one kind of learnability. Essentially it tells us whether it is possible to learn a hypothesis class \(H\)

The lower bound function \(m_H\), tells you how many samples you need, and then if you have this many samples, you can find some algorithm to train on these samples, and the outcome, is good on the population distribution.

definition

A hypothesis class \(H\) is PAC learnable if there exist a function \(m_H(\epsilon, \delta)\) and a learning algorithm(not essentially ERM) with the following property:

For every \(\epsilon, \delta \in (0,1)\), for every distribution \(D\) over \(X\), and for every labeling function \(f: X → \{0,1\}\), if the realizable assumption holds with respect to \(H, D, f\), then when running the learning algorithm on \(m \ge m_H(\epsilon, \delta)\) iid examples generated by \(D\) and labeled by \(f\), the algorithm returns a hypothesis \(h\) such that, with probability of at least \(1 - \delta\) (over the choice of the examples):

\[L_{D,f}(h) \le \epsilon \]

Finite H is PAC learnable by ERM.

Agnostic PAC learnable

But don’t forget you need a realizable assumption to hold. Realizable is too strong!

The Bayes optimal predictor:

\[f(x)=\left\{ \begin{aligned} &1 ~~~~P(y = 1|x) \ge \frac{1}{2}\\ &0~~~~\text{otherwise } \end{aligned} \right. \]

Given any probability distribution, Bayes optimal predictor is the best label predicting function. No other classifier has a lower error.

So realizable assumption fails to hold if the distribution is noisy or weird. If it does not hold, it might be hopeless to have \(L_{D,f} (h) ≤ \epsilon\). But that’s fine, we may have the adjusted version of PAC learning.

It’s more general than PAC learnable. 引入了一个数据集本身的的一个loss,就是最高分也达不到满分。

Definition of Agnostic PAC learnable:

A hypothesis class \(H\) is Agnostic PAC learnable if there exist a function \(m_H(\epsilon, \delta)\) and a learning algorithm(not essentially ERM) with the following property:

For every \(\epsilon, \delta \in (0,1)\), for every distribution \(D\) over \(X\), then when running the learning algorithm on \(m \ge m_H(\epsilon, \delta)\) iid examples generated by \(D\) and labeled by \(f\), the algorithm returns a hypothesis \(h\) such that, with probability of at least \(1 - \delta\) (over the choice of the examples):

\[L_{D}(h) \le \min_{h^\star \in H}L_D(h^*) + \epsilon \]

It is more general than PAC learnable.

VC dimension

Is an infinite hypothesis class \(H\) learnable? The answer can be yes. But how can we distinguish the infinite learnable \(H\) and infinite unlearnable one? We need to introduce a VC dimension to help us.

Restriction of \(H\) to \(C\)

Let \(H\) be a class of functions from \(X\) to \(\{0,1\}\), and let \(C = \{c_1, \ldots, c_m\} \subset X\). The restriction of \(H\) to \(C\) is the set of functions from \(C\) to \(\{0,1\}\) that can be derived from \(H\). That is,

\[H_C = \{ h_{c_1}, \ldots, h_{c_m} : h \in H \} \]

where we represent each function from \(C\) to \(\{0,1\}\) as a vector in \(\{0,1\}^{|C|}\).

Shattering
A hypothesis class \(H\) shatters a finite set \(C \subset X\) if the restriction of \(H\) to \(C\) is the set of all functions from \(C\) to \(\{0,1\}\). That is \(|H_C| = 2^{|C|}\)

Corollary 6.4. like no free lunch theorem but about shattering

Corollary 6.4 tells us that if H shatters some set \(C\) of size \(2m\) then we cannot learn \(H\) using m examples. Intuitively, if a set \(C\) is shattered by \(H\), we only contain half of the instances, the labels of these instances give us no information about the labels of the rest of the instances in \(C\)

VC-dimension

The VC dimension of a hypothesis class \(H\) is the maximal size of a set C that can be shattered by \(H\). If \(H\) can shatter sets of arbitrarily large size we say that H has infinite VC-dimension.

Theorem 6.6. Let \(H\) be a class of infinite VC-dimension. Then, \(H\) is not PAC learnable.

Theorem 6.8 (the fundamental theorem of statistical learning – quantitative version).

omit here, give a coarse range of \(m_H(\epsilon, \delta)\)

VC dimension is one way to measure sample complexity. But it also has a few drawbacks.

One main problem of the VC dimension is that It is mainly used for classification (due to the shattering notion).

what shall we do for regression? We may use Rademacher complexity

Rademacher complexity

\(\epsilon\)-representative sample

A training set \(S\) is called \(\epsilon\) - representative s.t.

\[\sup_{h \in H}|L_D(h) - L_S(h)| \le \epsilon \]

We define the representativeness of \(S\) with respect to \(F\) as the largest gap between the true error of a function \(f\) and its empirical error:

\[Rep_D(F, S) = \sup_{f\in F}(L_D(f)-L_S(f)) \]

Now, suppose we would like to estimate the representativeness of \(S\) using the sample \(S\) only. One simple idea is to split \(S\) into two disjoint sets, \(S = S_1 \cup S_2\), refer to \(S_1\) as a validation set and to \(S_2\) as a training set. We can then estimate the representativeness of \(S\) by:

\[\sup_{f\in F}(L_{S_1}(f)-L_{S_2}(f)) \]

为了使这个式子变得更简单,我们于是引入的rademacher random variable:
\((\sigma_1, \sigma_2, ..., \sigma_m) \in \{-1, +1\}^m\) to be a vector s.t. \(S_1 = \{z_i = (x_i, y_i): \sigma_i = 1\}\) and \(S_2 = \{z_i = (x_i, y_i): \sigma_i = -1\}\). If we assume \(E[\sigma_i] = 0\), that is \(|S_1|=|S_2|\) then:

\[\sup_{f\in F}(L_{S_1}(f)-L_{S_2}(f)) = \frac{2}{m}\sup_{f\in F} \sum_{i=1}^m\sigma_i f(z_i) \]

Rademacher complexity captures this idea by considering the expectation of the above with respect to a random choice of \(\sigma\).
So Rademacher complexity of \(

标签:le,frac,泛化,Big,sum,学习,机器,sigma,align
From: https://www.cnblogs.com/yzc5827/p/17979842

相关文章

  • 机器学习第1章: 概述
    Chapter1:GeneralIntroductionAcknowledgment:MostoftheknowledgecomesfromYuanYang'scourse"MachineLearning".监督学习概况Supervisedlearningisanimportantsub-areaofmachinelearning.Input:\(X=(x_1,x_2,\ldots,x_N)\)Outpu......
  • 小样本学习One-shot
    2024/1/141.什么是One-shot单样本学习(One-shotlearning)是机器学习领域的一个研究方向,重点是让模型能够仅通过一个训练样本来学习信息。什么是一个训练样本:指的是模型训练过程中只使用一个或少量例子或数据点来学习一个特定类别或任务。如果实在难以理解可以找一篇论文直接......
  • Queue-Linked List Implementation【1月22日学习笔记】
    点击查看代码//Queue-LinkedListImplementation#include<iostream>usingnamespacestd;structnode{ intdata; node*next;};node*front=NULL;node*rear=NULL;//末指针·,则不用遍历整个链表,constanttimevoidEnqueue(intx){ node*temp=newnode; ......
  • 初学者如何学习编程(从2014到2023年十年编程工作总结)
    今天给大家分享一个话题,如何有效的学习编程,大家都知道,我是计算机专业毕业的,2008年开始学习编程,2014年研究生毕业后一直从事软件开发工作,先后在京东、爱奇艺、完美世界从事过软件开发工程师工作,具有十多年编程经验积累,所以我来讲这个话题,我是有发言权的,也具有一定的权威性。好的......
  • elasticsearch学习笔记1 - 安装
    本次编写es笔记是为了记录学习到的es知识点,给大家一个快速理解和方便查找的地方。一、了解一下es是什么?为什么要使用es?   因为系统一步一步运行,数据越来越多,每天产生的订单差不动2,3w的数据量,MYSQL数据的查询越来越吃力,然后领导要求能不能先办法解决一下。 然后呢,在网......
  • Power BI - 5分钟学习新增度量值
    每天5分钟,今天介绍PowerBI新增度量值在PowerBIDesktop中,你可以创建度量值。度量值用于计算表达式的结果。在创建自己的度量值时,需要使用DAX语言。DAX包括超过200个函数、运算符等,几乎可以计算任何数据分析所需的结果。下面以计算产品销售举例创建TotalSales的度量值,带大家......
  • 神经网络优化篇:详解学习率衰减(Learning rate decay)
    学习率衰减加快学习算法的一个办法就是随时间慢慢减少学习率,将之称为学习率衰减,来看看如何做到,首先通过一个例子看看,为什么要计算学习率衰减。假设要使用mini-batch梯度下降法,mini-batch数量不大,大概64或者128个样本,在迭代过程中会有噪音(蓝色线),下降朝向这里的最小值,但是不会精......
  • 一起学习Avalonia
    一起学习Avalonia(一)一起学习Avalonia(二)一起学习Avalonia(三)一起学习Avalonia(四)一起学习Avalonia补充(Linux下的使用开发)一起学习Avalonia(五)一起学习Avalonia补充(deepin下的使用开发t调试)一起学习Avalonia(六)一起学习Avalonia(七)一起学习Avalonia(八)一起学习Avalonia(九)一起学习A......
  • FastAPI学习-29 uvicorn 使用 log_config 参数设置 logger 日志格式
    前言FastAPI服务是通过uvicorn来提供的,日志都是uvicorn里配置的。官方文档地址:https://www.uvicorn.org/settings/#logginguvicorn的logging日志我们可以通过uvicorn.run()方式启动服务uvicorn.run("example:app",port=5000,reload=True,access_log=False)于是可以加......
  • FastAPI学习-30 项目代码中添加自己的日志内容
    前言前面一篇【FastAPI学习-29uvicorn使用log_config参数设置logger日志格式】已经学会了配置uvicorn的日志。如何在fastapi项目代码中添加自己的日志呢?添加日志创建一个logger实例,名称为"fast"fromfastapiimportFastAPIimportlogginglogger=logging.getLogger(......