首页 > 其他分享 >Mechanism of Machine Learning

Mechanism of Machine Learning

时间:2024-02-06 18:34:13浏览次数:34  
标签:function sample epsilon Mechanism Machine leq namely Learning delta

1. What is Machine Learning?

Machine Learning = Statistic Learning, namely, there is an unknown distribution of (x1, x2,..., xm, y), the task is to infer y when (x1, x2,..., xm) is known, Although the distribution is unknown, so the relation between (x1, x2, ..., xm) and y is not known, we can take a sample from the distribution, and get the relation between (x1, x2,..., xm) and y of the sample, since each (x1, x2,..., xm, y) is known in the sample, then use the relation between (x1, x2, ..., xm) and y of the sample to estimate the relation between (x1, x2, ..., xm) and y of the distribution based on some statistic rules.

2. So with (X, y) known in the sample, how to get f:X->y of the sample?

Note: X = vector {x1, x2, ..., xm}

The answer is universal approximator, namely a function with parameters, which can fit any function to any precision with combination of enough units.
Some universal approximators are:
neural network: its unit is one node in one layer, whose mathematical form is
activation(wx + b), where activation is a non-linear function such as relu, sigmoid etc. The purpose of activation is introducing non-linearity in neural network. neural network which has one layer with enough units also can fit any function, but compared to neural network which has multiple layers, it needs more units based on experiences, more units = more parameters, more parameters = larger estimation error, the reason will be illustrated later, so usually neural network with multiple layers is used, It is also called deep learning.
decision forest: its unit is one node in one decision tree.Though one decision tree with enough nodes also can fit any function, using multiple decision trees (namely decision forest) is better in relieving estimation error based on experiences.
polynomial: its unit is \(x^k\).
With one universal approximator with parameters (this is also called a model), compute the parameters by fitting it to the sample, namely, assuming the model is: $ y_{infer} = f(x; w) $, where $ w $ is parameter, $ w = \underset{w} {argmin} |y_{true} - y_{infer}| $. Namely, to get parameter, we just need to minimize $ |y_{true} - y_{infer}| $, this is also called fittig.
Minimization is also called optimization in mathematics. Smooth function, namely who has gradient everywhere, is easy to deal with. $ f(y_{infer} = |y_{true} - y_{infer}| $ is not a smooth function, $f(y_{infer}) = (y_{true} - y_{infer})^2 $ is a smooth function, and they have minimum at the same $ y_{infer}, and $ y_{infer} = f(x; w) $, so also at the same $ w $, so usually the objective function to optimize is $ f(w} = (y_{true} - y_{infer})^2 $. Note, here y is assumed numeric, not categorical, the objective function to optimize for categorical y will be illustrated later. The objective function to optimize is also called loss function in Machine Learning.

3. with $ f: X \rightarrow y $ known for the sample, how to estimate the $ f: X \rightarrow y $ for the distribution?

3.1 PAC (Probably Approximately Correct) learnable

A model H is defined as PAC learnable if there exits a function $ m_H: (0, 1)^2 \rightarrow N $ and a learning algorithm with the following property:
for every $ \epsilon, \delta \in (0, 1) $, for every distribution D over (X, y), when running the learning algorithm on m \geq m_H(\epsilon, \delta) i.i.d examples generated by D, the algorithm returns a $ h: x \rightarrow y $, with probability of at least $ 1 - \delta $ (over the choice of the m training examples), $ L_{D}(h) \leq min\underset{h\in H} L_{D}(h^) + \epsilon $ .

Note:
$ L_D $ : loss of the distribution.

3.2 neural network and decision forest are PAC learnable.

3.2.1 Finite VC dimentsion $\Leftrightarrow $ PAC learnable in binary classification.

definition of VC dimension: VC dimension oF a model H IS the maximum size of a set $ C \subset X $ , that can be shattered by H.

definition of shattering: a model shatters a finite set $ C \subset X $ if the restriction of H to C is all functions from C to {0, 1}.

definition of restriction of H to C: the restriction of H TO C is the set of functions from C to {0, 1} that can be derived from H, namely,
$ H_C = {h(c1), h(c2),..., h(cm)): h \in H} $.

3.2.2 the fundamental theorem of statistical learning:
Let H be a model from a domain X to {0,1}, and let the loss function be the 0-1 loss. Assume that VCdim(H) = d < \infinite, then, there are absolute constants C1, C2 such that:
H is agnostic PAC learnable with sample complexity:
$ C_1{d + log(1/\delta) \div {\epsilon^2} \leq m_H(\epsilon, \delta) \leq C_2 {d + log(1/\delta) \div {epsilon^2} $

Note: Though the upper theorem is discussed in the book for binary classification, based on my experience, it is ok to generalize to y is multiple categorical or numeric.

3.2.3 from the fundamental theorem of statistical learning, it can be seen that sample complexity increase as VC dimension increases. And for neural network and decision forest, VC dimension increases as number of parameters increases.

For neural network, assuming that there are m training examples, if y is numeric, it is known that m points can at most decide m parameters. if y is categorical, m points can at most put restriction on m parameters, so if there are more parameters than m, such as m+1, m training examples can not put any restriction on the m+1 parameter, namely, the m+1 parameter can be any number without influencing fitting training examples, but to fit the distribution, it has restriction on the m+1 parameter, so number of parameters can not be more than number of examples, or the learned model can not fit the distribution well at all since the m+1 parameter is set randomly.

3.2.4 tradeoff between larger model capacity to reduce approximation error and smaller model capacity to reduce estimation error.

$ L_D(A(S)) = L_D(A(S)) - L_D(h^) + L_D(h^) = L_D(A(S)) - l_V(A(S)) + L_V(A(S)) - L_S(A(S)) + L_S(A(S))$

$ L_D(h^) $ is defined as approximation error, is $ min \underset {h in H } l_D(h) \(. \) L_D(A(S)) - L_D(h^) $ is defined as estimation error.
A(S) is learned model by algorithm H from sample S, in model H.

The goal of Machine Learning is to get small enough $ L_D(A(S)) $, namely both $ l_D(h^) $ and $ L_D(A(S)) - L_D(h^) $ need to be small enough.
So when $ L_D(h^) $ will be small? $ L_D(h^) $ will be small if H contains a function who can fit sample S well.
Of course, the capacity of H larger, the probability of containing a function who can fit sample S well is higher, But, the capacity of H larger, lager capacity of H = larger number of parameters = larger VC dimension = larger sample complexity = with a fixed sample size, and fixed \epsilon, $ \delta = L_D(A(S)) - L_D(h^) $ is larger. So need a good tradeoff between small $ L_D(h^) $ and small $ L_D(A(S)) - L_D(h^) $, namely, need to choose a suitable H capacity, a good way is to make $ L_D(h^) $ small enough first, then if at this point, $ L_D(A(S)) - L_D(h^) $ is large, this means the capacity of H is too large, reduce the capacity little by little to see at which capacity, both $ L_D(h^) $ and $ L_D(A(S)) - L_D(h^*) $ can be small enough, If the good tradeoff which can be gotten still is not good enough result, this means that the sample size is not enough, this sample size can get this good result at most, if want better result, get more training examples.

In general, as long as the approximation error is greater than zero we expect the training error to grow with the sample size, as a larger amount of data points makes it harder to provide an explanation for all of them. On the other hand, the validation error tends to decrease with the increase in sample size. If the VC dimension is finite, when the sample size goes to infinity, the validation and train errors converge to the approximation error. Therefor, by extrapolating the training and validation curves we can try to guess the value of the approximation error, or at least to get a rough estimate on an interval in which the approximation error resides.

$ L_D(A(S)) - l_V(A(S)) $ can be tight bounded according to Hoeffding's ineqality:

Let Z_1,..., Z_m be a sequence of i.i.d random variables, TAssume that $ E[Z] = \miu $ and P(a \leq Z_i \leq b] = 1 for every i. Then, for any \epsilon > 0:
$ P[|1/m \sum_{i=1}^m Z_i - \miu| > \epsilon ] \leq 2exp(-2m\epsilon^2/(b - a)^2) .

This is tighter bound for L_D than the bound for L_D in definition of agnostic PAC learnable, using sample complexity, which can be calculated based on the fundamental theorem of statistical learning.

For example, assuming $ loss \in [0, 1] $, VC dimension d = 100, $\epsilon = 0.1 $, $ \delta = 0.1 $ according to the fundamental theorem of statistical learning:
$ C_1{d + log(1/\delta) \div {\epsilon^2} \leq m_H(\epsilon, \delta) \leq C_2 {d + log(1/\delta) \div {epsilon^2} $

$ C_1 \times 10,230.3 \leq m_H^{agnostic PAC}(\epsilon, \delta) \leq C_2 \times 10,230.3

$ P[L_D(A(S)) \leq l_D(h^*) + \delta] geq {1 - \epsilon}

$ P[[|L_V(A(S)) - L_D(A(S))| \leq \epsilon] \geq {1 - 2exp(-2m\epsilon2/(b-a)2)} $,

if m_V = 300, $ epsilon = 0.1 $, then:

$ P[[|L_V(A(S)) - L_D(A(S))| \leq \epsilon] \geq 0.995 $

Note:
The case of large $ L_D(h^) $ is called underfit, the case of large $ L_D(A(S)) - L_D(h^) $ is called overfit.

3.2.5 methods to relieve overfit for neural network
3.2.5.1 Regularization

3.2.5.2 Dropout

3.2.5.3 Early Stopping
Gradient Descent Optimization updates W at the direction of negative gradient, since this is the direction the objective function decreases fastest, namely, if fix $ ||\Delta w ||_2 $, the ojective function decreases most in this direction. It fits the main trend of the sample first since the main trend is of most examples, so fitting the main trend decreases the loss (namely objective function of Machine Learning) most,and the main trend is consistent with the distribution since the examples in the sample is generated from the distribution, then it fits the tiny detail of individual examples, which is specific for the sample. not general for the distribution, fitting this leads to overfit, so when it starts to fit this, namely when loss of validation set not improve any more, stopping the optimization is good, this is called early stopping.

标签:function,sample,epsilon,Mechanism,Machine,leq,namely,Learning,delta
From: https://www.cnblogs.com/zhenxia-jiuyou/p/18008311

相关文章

  • C++(learning)
     模板宏例子,用于创建get()、set()#defineWELD_ATTACH_INFO_SETGET(T,FUN,VAR)\inlineTget##FUN()const{returnVAR;}\inlinevoidset##FUN(Tt){VAR=t;}WELD_THRESHOLD_SETGET(int,InitId,init_id_) 方便引用#ifndefUSE_PLANDATA#defineUSE_PL......
  • Reinforcement Learning Chapter2
    本文参考《ReinforcementLearning:AnIntroduction(2ndEdition)》SuttonK臂赌博机问题描述:你有k个选择,每个选择对应一个奖励,收益由所选动作决定的平稳概率分布产生,目标为最大化某段时间内的总收益期望。联系我们在chapter1中提到的reward,value,action等概念,我们在这个K臂赌博机......
  • m基于Q-Learning强化学习的异构网络小区范围扩展(CRE)技术matlab仿真
    1.算法仿真效果matlab2022a仿真结果如下: 2.算法涉及理论知识概要        基于Q-Learning强化学习的异构网络小区范围扩展(CellRangeExtension,CRE)技术是一种旨在优化异构无线网络性能的方法。异构网络是由不同类型的基站(如宏基站、微基站、皮基站等)组成的网络,这......
  • Codeforces Round 170 (Div. 1)A. Learning Languages并查集
    如果两个人会的语言中有共同语言那么他们之间就可以交流,并且如果a和b可以交流,b和c可以交流,那么a和c也可以交流,具有传递性,就容易联想到并查集,我们将人和语言看成元素,一个人会几种语言的话,就将这些语言和这个人所在的集合合并,最后求一下人一共在几个连通块中,连通块的个数-1就是答案,......
  • EasyCVR智能边缘网关启动失败报错“Local Machine Check Error”的解决方法
    国标GB28181安防监控系统EasyCVR平台采用了开放式的网络结构,可支持4G、5G、WiFi、有线等方式进行视频的接入与传输、处理和分发。安防视频监控平台EasyCVR还能支持GIS电子地图模式,基于监控摄像头的经纬度地理位置信息,将场景中的整体安防布控以可视化地图方式呈现,管理人员可以方便......
  • EasyCVR启动失败报错“Local Machine Check Error”的解决方法
    有用户反馈EasyCVR智能边缘网关启动失败,导致服务无法使用,今天我们来分析一下问题的排查与解决方法。1)查看报错日志,如下:2)报错为“LocalMachineCheckError!本地机器检查错误!”,检查配置文件是否因为hardware_version字段影响了服务启动;3)将该字段参数进行注释,然后再次启动EasyCVR查看......
  • 神经网络优化篇:详解学习率衰减(Learning rate decay)
    学习率衰减加快学习算法的一个办法就是随时间慢慢减少学习率,将之称为学习率衰减,来看看如何做到,首先通过一个例子看看,为什么要计算学习率衰减。假设要使用mini-batch梯度下降法,mini-batch数量不大,大概64或者128个样本,在迭代过程中会有噪音(蓝色线),下降朝向这里的最小值,但是不会精......
  • 《Deep Long-Tailed Learning: A Survey》阅读笔记
    论文标题《DeepLong-TailedLearning:ASurvey》深度长尾学习:调查作者YifanZhang、BingyiKang、BryanHooi、ShuichengYan(IEEEFellow)和JiashiFeng来自新加坡国立大学计算机学院、字节跳动AILab和SEAAILab初读摘要长尾类别不平衡(long-tailedclassimbala......
  • ML系列-《The Hundred-Page Machine Learning book》-读书
    Abouttheauthor:作者简介安德烈·布可夫(AndriyBurkov)是一位机器学习专家,目前居住于加拿大魁北克省。他拥有人工智能博士学位,尤其擅长自然语言处理技术。目前,他是高德纳(Gartner)咨询公司机器学习开发团队的主管。该团队的主要工作是,使用浅层和深度学习技术,开发可用于生产环境......
  • 基于自注意力机制的轻量级人体姿态估计(Lightweight Human Pose Estimation Based on
    写在前面本文是一篇于2023年3月21日发表在2023InternationalConferenceonBigData,EnvironmentalIndustryandMaterialsScience(ICBDEIMS2023)的一篇会议论文。论文主要聚焦于解决单签人体姿态估计网络模型中普遍存在的参数多、计算复杂度高、检测时间长的问题,文章采用......