Chapter 3: Generalization Theory
泛化理论想解决一个什么样的问题呢?
已知\(L_{train} = \epsilon\), what can we say on \(L_D\) (population loss)?
- The traditional way is sampling from D again to get a test set and then get \(L_{test}\).
- We can use theory to get generalization bounds.
The latter is what generalization theory wants to do.
No free lunch theorem
It says that if we don’t have any prior knowledge of the data distribution, we can't design an algorithm that always works well. In other words, there is no universal learner. Well, based on this theorem, even human brains only work in certain scenarios. But this theorem considers a really weird setting(没有先验的学习label), which may not exist in practice.
Theorem 5.1 (no-free-lunch) Let \(A\) be any learning algorithm for the task of binary classification with respect to a loss within \([0, 1]\) over a domain \(X\). Let \(m\) be any number smaller than \(|X| /2\), representing a training set size. Then, there exists a distribution \(D\) over s.t.
- There exists a function \(f: X \rightarrow{} {0, 1}\) with \(L_D(f) = 0\)
- With probability of at least \(\frac{1}{7}\) over the choice of \(S \sim D^m\) (that is \(S\) is training set size \(m\)) we have that \(L_D(A(S)) \ge \frac{1}{8}\)
That is
\[P_{S \sim \mathcal D^m}\left(L_{\mathcal D}(A(S)) \geq \frac 18\right) \geq \frac 17 \]首先我们需要明确一下这些符号的意思:
\(X\) 是原始数据的集合。 \(X \times \{0, 1\}\)是一个二分类的数据集空间,是数据集和标签集的直积。\(D\) 是对\(X \times \{0, 1\}\)这个数据集空间进行有概率地采样。也就是一个带标签的数据集。只有一些数据集是合法的,由条件1来控制。也就是至少要有一个\(f\)在这个数据集上能做到满分。
证明前我们先考虑这个条件,假设全集\(X\)大小为\(2m\),那么就有\(T=2^{2m}\) 种合法的函数(没有限制)能够把\(X\) 映射到 \(\{0, 1\}\) 这个二分类标签集。这些函数记作\(\{f_1, ..., f_T\}\), 由这些函数定义的\(D\) 记作\(\{D_1, ..., D_T\}\):
\[D_i(x_j, f_i(x_j)) = \frac{1}{2m}, \forall x_j \in X \]也就是\(D_i\) 里面有\(2m\)条\(f_i\)标注的数据,每条等概率。
这样的\(T\) 个\(D\)上都满足条件(1),也就是存在一个全对的函数。
接下来的证明该怎么想呢?根据我们改写的这个概率形式,我们就会很自然的想到markov's inequality:
\[\Pr(x > a) \le \frac{E|x|}{a} \]加上一个我们比较直观的intuition,就是一半的题目我见过我会做,一半的题目我只能蒙,二分类有一半概率蒙对,所以正确率是3/4
\[\begin{align*} E_{S\sim D^m} [1- L_D(A(S))] &\le \frac{3}{4} \\ \Pr (1-L_D(A(S)) > \frac{7}{8}) &\le \frac{E_{S\sim D^m} [1- L_D(A(S))]}{7/8} \le \frac{6}{7}\\ \Pr (L_D(A(S)) <\frac{1}{8}) &\le \frac{6}{7}\\ \Pr (L_D(A(S)) >\frac{1}{8}) &\ge \frac{1}{7}\\ \end{align*} \]接下来的关键就在于证明\(E_{S\sim D^m} [1- L_D(A(S))]\le \frac{3}{4}\):
对于任意一个\(D_i\), 我都可以在其中取出\(k=(2m)^m\)次方个不同的训练集,记作\(\{S_1, ..., S_k\}\). 为了表示这个是从\(D_i\)中取出的,或者说是由\(f_i\) 标注的,加上上标\(i\), 记作\(\{S_1^i, ..., S_k^i\}\)
\[\begin{aligned} \max_{i \in [T]}\mathbb E_{S \sim \mathcal D_i^m} \Big(L_{\mathcal D_i} \big(A(S)\big) \Big) &= \max_{i \in [T]}\frac 1k \sum_{j=1}^k L_{\mathcal D_i}(A(S_j^i)) \\ &\geq \frac 1T \sum_{i=1}^T \frac 1k \sum_{j=1}^k L_{\mathcal D_i}(A(S_j^i)) \\ &\geq \min_{j \in [k]} \frac 1T \sum_{i=1}^T L_{\mathcal D_i}(A(S_j^i)) ~\text{max>均值>min} \end{aligned} \]原本是对于不同的训练集求均值,现在是对于不同的标注方式求均值,事情就变得了简单了起来,因为训练元素固定,我们现在可以轻松的去定义,没有出现在训练集中的元素。定义这些元素为 \(\{v_1, ..., v_p\}\) 是没有出现在训练集中的元素. 因为有可能训练集多次取到了相同的元素,所以总数小于等于m, \(p \ge m\):
\[\begin{align*} L_{\mathcal D_i}(h) &= \frac 1{2m} \sum_{x \in C} [h(x) \neq f_i(x)]\\ & \geq \frac 1{2m}\sum_{r=1}^p \mathbb{1}[h(v_r) \neq f_i(v_r)] \\ & \geq \frac 1{2p} \sum_{r=1}^p \mathbb{1}[h(v_r) \neq f_i(v_r)] \end{align*} \]So对所有\(D_i\) 求均值我们能得到:
\[\begin{align*} \frac 1T \sum_{i=1}^T L_{\mathcal D_i}(A(S_j^i)) & \ge \frac 1T \sum_{i=1}^T\frac 1{2p} \sum_{r=1}^p \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big]\\ &= \frac 1{2} \frac{1}{p}\sum_{r=1}^p\frac 1T \sum_{i=1}^T \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big]\\ &\ge \frac 1{2} \min_{r\in [p]} \frac 1T \sum_{i=1}^T \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big]\\ \end{align*} \]不管你学到的\((A(S_j^i))(v_r)\)是什么, 遍历所有 \(f_i(v_r)\), 肯定是一半概率错一半概率对,所以\(\forall r\) 我们有:
\[\frac 1T \sum_{i=1}^T \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big] = \frac 12 \]甚至都不用取\(\min\)就可以得到:
\[\begin{align*} \max_{i \in [T]}\mathbb E_{S \sim \mathcal D_i^m} \Big(L_{\mathcal D_i} \big(A(S)\big) \Big) &\ge \frac 1T \sum_{i=1}^T L_{\mathcal D_i}(A(S_j^i))\\ &\ge \frac 1{2} \frac 1T \sum_{i=1}^T \mathbb{1}\Big[(A(S_j^i))(v_r) \neq f_i(v_r)\Big] = \frac{1}{4} \end{align*} \]Finite hypothesis class and ERM
Class of H, the class of functions can be finite/infinite.
\(ERM_H\): pick the hypothesis with the smallest training loss
\[ERM_H(S) \in \arg \min _{h \in H} L_S(h) \]If H is a infinite class, \(ERM_H\) may not give the best solution because of Overfitting (memorization).
Realizability assumption
To show this \(ERM\) algorithm is powerful, we raise the Realizability assumption:
There exists \(h^∗ \in H\) s.t. \(L_{D,f} (h^∗)= 0\).
Note that this assumption implies that with probability 1 over random samples, \(S\), where the instances of \(S\) are sampled according to \(D\) and are labeled by \(f\), we have \(L_{S} (h^*) = 0\)
Later we will see that it is not always possible to pick a function that perfectly solves the problem.
corollary 2.3.
It means, that if the problem is realizable, then the ERM hypothesis is good enough. 也就是达到的采样数量,ERM学到的函数就足够好。
Let \(H\) be a finite hypothesis class. Let \(\delta \in \{0,1\}\)and \(\epsilon>0\) and let \(m\) be an integer that satisfies \(m \ge \log(\frac{|H|}{\delta})/\epsilon\). Then, for any labeling function \(f\) and for any distribution \(D\), for which the realizability assumption holds (that is, for some \(h \in H\), \(L_{D,f}(h) = 0\)), with probability of at least \(1 − \delta\) over the choice of an iid sample \(S\) of size \(m\), we have them for every ERM hypothesis, \(h_S\), it holds that:
\[L_{D,f} (h_S) ≤ \epsilon \]Proof:
We would like to bound the probability of the event \(L_{D, f}(h_S) > \epsilon\)
Let \(H_B\) be the bad hypothesis, that is:
\[H_B = \{h \in H: L_{D, f}(h) > \epsilon \} \]But, since the realizability assumption implies that \(L_S(h_S) = 0\), 所以ERM取到的bad hypothesis 也要在S上拿到满分。
Let \(M\) be the training set that can train a bad hypothesis:
\[M = \{S: \exist h \in H_B,L_S(h) = 0\} \]只有取到\(M\) 中的\(S\), 才有可能训出这样的bad hypothesis
\[\begin{align*} D^m(\{S: L_{D, f}(h_S) &> \epsilon\}) \le D^m(M) \\ &= D^m(\cup_{h\in H_B}\{S: L_S(h) = 0\}) ~~\text{(定义)}\\ &\le \sum_{h\in H_B}D^m(\{S: L_S(h) = 0\})~~\text{(union bound)} \end{align*} \]第一个不等式是因为,如果这个S上训出来的\(h_S\)不好,那起码这要是一个能训出来bad hypothesis的S,小于号因为是必要条件。
好现在对于一个fix的\(h \in H_B\), 我们来选\(S\) 中的元素。\(D\)中有大于\(\epsilon\)的元素是\(h\)会预测错的,所以不能选。只能选剩下的的那$1- \epsilon $
\[D(\{x_i: h(x_i) = y_i\}) = 1-L_{D, f}(h) \le 1- \epsilon \]\[\begin{align*} D^m(\{S: h_S(h) = 0\}) &= D^m(\{\forall x_i \in S: h(x_i) = y_i\}) (\text{选中一些他会的例子})\\ &= \prod_{i=1}^m D(\{x_i: h(x_i) = y_i\})(\text{独立事件})\\ &\le (1-\epsilon)^m \le e^{-m} \end{align*} \]最后:
\[\sum_{h\in H_B}D^m(\{S: h_S(h) = 0\}) \le |H_B|e^{-m} \le |H|e^{-m}(\text{这也就是为什么要finite}) \]要让\(|H|e^{-m} \le \delta\), 解得:
\[m \ge \frac{\log(\frac{|H|}{\delta})}{\epsilon} \]PAC learnable
It’s one kind of learnability. Essentially it tells us whether it is possible to learn a hypothesis class \(H\)
The lower bound function \(m_H\), tells you how many samples you need, and then if you have this many samples, you can find some algorithm to train on these samples, and the outcome, is good on the population distribution.
definition
A hypothesis class \(H\) is PAC learnable if there exist a function \(m_H(\epsilon, \delta)\) and a learning algorithm(not essentially ERM) with the following property:
For every \(\epsilon, \delta \in (0,1)\), for every distribution \(D\) over \(X\), and for every labeling function \(f: X → \{0,1\}\), if the realizable assumption holds with respect to \(H, D, f\), then when running the learning algorithm on \(m \ge m_H(\epsilon, \delta)\) iid examples generated by \(D\) and labeled by \(f\), the algorithm returns a hypothesis \(h\) such that, with probability of at least \(1 - \delta\) (over the choice of the examples):
\[L_{D,f}(h) \le \epsilon \]Finite H is PAC learnable by ERM.
Agnostic PAC learnable
But don’t forget you need a realizable assumption to hold. Realizable is too strong!
The Bayes optimal predictor:
\[f(x)=\left\{ \begin{aligned} &1 ~~~~P(y = 1|x) \ge \frac{1}{2}\\ &0~~~~\text{otherwise } \end{aligned} \right. \]Given any probability distribution, Bayes optimal predictor is the best label predicting function. No other classifier has a lower error.
So realizable assumption fails to hold if the distribution is noisy or weird. If it does not hold, it might be hopeless to have \(L_{D,f} (h) ≤ \epsilon\). But that’s fine, we may have the adjusted version of PAC learning.
It’s more general than PAC learnable. 引入了一个数据集本身的的一个loss,就是最高分也达不到满分。
Definition of Agnostic PAC learnable:
A hypothesis class \(H\) is Agnostic PAC learnable if there exist a function \(m_H(\epsilon, \delta)\) and a learning algorithm(not essentially ERM) with the following property:
For every \(\epsilon, \delta \in (0,1)\), for every distribution \(D\) over \(X\), then when running the learning algorithm on \(m \ge m_H(\epsilon, \delta)\) iid examples generated by \(D\) and labeled by \(f\), the algorithm returns a hypothesis \(h\) such that, with probability of at least \(1 - \delta\) (over the choice of the examples):
\[L_{D}(h) \le \min_{h^\star \in H}L_D(h^*) + \epsilon \]It is more general than PAC learnable.
VC dimension
Is an infinite hypothesis class \(H\) learnable? The answer can be yes. But how can we distinguish the infinite learnable \(H\) and infinite unlearnable one? We need to introduce a VC dimension to help us.
Restriction of \(H\) to \(C\)
Let \(H\) be a class of functions from \(X\) to \(\{0,1\}\), and let \(C = \{c_1, \ldots, c_m\} \subset X\). The restriction of \(H\) to \(C\) is the set of functions from \(C\) to \(\{0,1\}\) that can be derived from \(H\). That is,
\[H_C = \{ h_{c_1}, \ldots, h_{c_m} : h \in H \} \]where we represent each function from \(C\) to \(\{0,1\}\) as a vector in \(\{0,1\}^{|C|}\).
Shattering
A hypothesis class \(H\) shatters a finite set \(C \subset X\) if the restriction of \(H\) to \(C\) is the set of all functions from \(C\) to \(\{0,1\}\). That is \(|H_C| = 2^{|C|}\)
Corollary 6.4. like no free lunch theorem but about shattering
Corollary 6.4 tells us that if H shatters some set \(C\) of size \(2m\) then we cannot learn \(H\) using m examples. Intuitively, if a set \(C\) is shattered by \(H\), we only contain half of the instances, the labels of these instances give us no information about the labels of the rest of the instances in \(C\)
VC-dimension
The VC dimension of a hypothesis class \(H\) is the maximal size of a set C that can be shattered by \(H\). If \(H\) can shatter sets of arbitrarily large size we say that H has infinite VC-dimension.
Theorem 6.6. Let \(H\) be a class of infinite VC-dimension. Then, \(H\) is not PAC learnable.
Theorem 6.8 (the fundamental theorem of statistical learning – quantitative version).
omit here, give a coarse range of \(m_H(\epsilon, \delta)\)
VC dimension is one way to measure sample complexity. But it also has a few drawbacks.
One main problem of the VC dimension is that It is mainly used for classification (due to the shattering notion).
what shall we do for regression? We may use Rademacher complexity
Rademacher complexity
\(\epsilon\)-representative sample
A training set \(S\) is called \(\epsilon\) - representative s.t.
\[\sup_{h \in H}|L_D(h) - L_S(h)| \le \epsilon \]We define the representativeness of \(S\) with respect to \(F\) as the largest gap between the true error of a function \(f\) and its empirical error:
\[Rep_D(F, S) = \sup_{f\in F}(L_D(f)-L_S(f)) \]Now, suppose we would like to estimate the representativeness of \(S\) using the sample \(S\) only. One simple idea is to split \(S\) into two disjoint sets, \(S = S_1 \cup S_2\), refer to \(S_1\) as a validation set and to \(S_2\) as a training set. We can then estimate the representativeness of \(S\) by:
\[\sup_{f\in F}(L_{S_1}(f)-L_{S_2}(f)) \]为了使这个式子变得更简单,我们于是引入的rademacher random variable:
\((\sigma_1, \sigma_2, ..., \sigma_m) \in \{-1, +1\}^m\) to be a vector s.t. \(S_1 = \{z_i = (x_i, y_i): \sigma_i = 1\}\) and \(S_2 = \{z_i = (x_i, y_i): \sigma_i = -1\}\). If we assume \(E[\sigma_i] = 0\), that is \(|S_1|=|S_2|\) then:
Rademacher complexity captures this idea by considering the expectation of the above with respect to a random choice of \(\sigma\).
So Rademacher complexity of \(