首页 > 其他分享 >人工智能:原理与技术 学习笔记

人工智能:原理与技术 学习笔记

时间:2024-11-15 22:18:32浏览次数:1  
标签:function 人工智能 max sum 笔记 theta policy 原理 gamma

Lecture 2

  • Supervised learning: regression, classification, ...

  • Unsupervised learning: clustering, dimensionality reduction, ...

  • The canonical machine learning problem: Given a set of training data \(\{(x_i,y_i)\}_{i=1}^m\) and a loss function \(\ell\), find the parameters \(\theta\) that minimizes the sum of losses \(f(\theta)=\sum\limits_{i=1}^m\ell(h_{\theta}(x_i),y_i)\).

  • Linear least squares problem: linear hypothesis function \(h_{\theta}(x)=\theta^Tx\), and \(\ell(h_{\theta}(x),y)=(h_{\theta}(x)-y)^2\).

    Solutions for linear least squares problem:

    • Gradient descent: Repeat \(\theta\leftarrow\theta-\alpha\nabla_{\theta}f(\theta)\).
    • Analytical method: Let \(X=\begin{bmatrix}x_1^T\\x_2^T\\\vdots\\x_m^T\end{bmatrix},y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_m\end{bmatrix}\), then \(f(\theta)=\lVert X\theta-y\rVert^2\), \(\nabla_{\theta}f=2X^T(X\theta-y)\), solve the equation \(2X^T(X\theta-y)=0\), we get \(\theta=(X^TX)^{-1}X^Ty\).
  • Linear regression: the hypothesis function to fit training data is linear to the parameters when basing on particular input features.

  • Linear classification (for \(2\) classes): the hypothesis function is linear to the parameters when basing on particular input features. And the true output is determined by the sign of the hypothesis function. (\(\hat{y}=\text{sign}(h_{\theta}(x)))\).

  • Loss functions for classification:

    • \(\ell_{0/1}:\ell(h_{\theta}(x),y)=[\text{sign}(h_{\theta})=\text{sign}(y)]\) (NP-hard to solve)
    • \(\ell_{\text{logistic}}:\ell(h_{\theta}(x),y)=\log(1+\exp(-y·h_{\theta}(x)))\)
    • \(\ell_{\text{hinge}}:\ell(h_{\theta}(x),y)=\max(1-y·h_{\theta}(x),0)\)
    • \(\ell_{\text{exp}}:\ell(h_{\theta}(x),y)=\exp(-y·h_{\theta}(x))\).

    Typically no closed-formed solution, solvable by gradient descent.

  • Support vector machine: solves the canonical machine learning optimization problem using hinge loss and linear hypothesis, i.e. \(f(\theta)=\sum\limits_{i=1}^m\max(1-y_i·\theta^Tx_i,0)\).

  • Logistic regression: solves the canonical machine learning optimization problem using logistic loss and linear hypothesis, i.e. \(f(\theta)=\sum\limits_{i=1}^m\log(1+\exp(-y_i·\theta^Tx_i))\).

  • Multiclass classification: Build \(k\) different classifiers \(h_{\theta_i}\) and output prediction as \(\operatorname{argmax}_ih_{\theta_i}(x)\). And loss function is defined as \(\ell(h_{\theta}(x),y)=-\log \dfrac{e^{h_{\theta_y}(x)}}{\sum_{i=1}^me^{h_{\theta_i}(x)}}=\log \sum_{i=1}^me^{h_{\theta_i}(x)}-h_{\theta_y}(x)\). (called softmax loss or cross entropy loss).

  • Overfitting: As the model becomes more complex, training loss always decreases, generalization loss decreases to a point then starts to increase.

  • Cross-validation: Divide the data set into training set and holdout, use training set to determine the parameters, use holdout/validation set to determine the hyperparameters (degree of polynomial, \(\lambda\) in regularization, ...).

  • Regularization: add a term \(\dfrac{\lambda}{2}\lVert\theta\rVert_2^2\) to the loss function \(f(\theta)\), where \(\lambda\) is a hyperparameter (when the model becomes complex, the parameters tend to be large in order to overfit the training data).

  • Even though we used a training/holdout split to fit the parameters, we are still effectively fitting the hyperparameters to the holdout set. Use a test set to evaluate the performance. And the best solutions are: evaluate your system “in the wild” as often a possible, recollect data if you suspect overfitting to present data, ...

Lecture 3

  • Neural network: composed non-linear functions.
  • Elements of a neural network: weight \(w\), bias \(b\), activate function \(f\) (\(z_i=f_i(W_iz_{i-1}+b_i))\).
  • Common activate functions:
    • Sigmoid: \(f(x)=\dfrac{1}{1+e^{-x}}\).
    • Rectified linear unit (ReLU): \(f(x)=\max(x,0)\).
    • Hyperbolic tangent: \(f(x)=\tanh(x)=\dfrac{e^{2x}-1}{e^{2x}+1}\).
  • Stochastic Gradient Descent (SGD): adjust parameters based upon just one random sample (or a small random collection of samples, called batch), i.e. \(\theta\gets \theta-\alpha\nabla_{\theta}\ell(h_\theta(x_i),y_i)\) for some random \(i\).
  • Backpropagation algorithm: Use the chain rule to recursively calculate the partial derivative of the loss function to every parameter, from the last layer to the first.
  • Momentum: \(v_t=\beta v_{t-1}+(1-\beta)\nabla_{\theta}f(\theta)\) and \(\theta\gets \theta-\eta v_t\). Usually, \(\beta=0.9\).

Lecture 4

  • Problems with fully connected networks: need a very large number of parameters, very likely to overfit the data; generic deep network also does not capture the “natural” invariances we expect in images (translation, scale).

  • Convolutional Neural Networks has 4 types of layers:

    1. Convolution: require that activations between layers only occur in “local" manner; require that all activations share the same weights.
    2. Non-linearity: Rectified Linear Unit (ReLU). Advantages: 1. Fast to compute; 2. No Cancellation problem; 3. More sparse activation volume; 4. Solving the vanishing gradient problem
    3. Pooling (or downsampling):
      • Pick a window size.
      • Pick a stride.
      • Walk the window across the image.
      • For each window, take the maximum value.
    4. Fully Connected Layer
  • Recurrent Neural Network: a type of neural network with memory, the output of hidden layer are stored in the memory, and the memory can be considered as another input, used to deal with sequential data.

  • Problems with Vanilla RNNs:

    • exploding gradient \(\to\) gradient clipping
    • vanishing gradient \(\to\) change RNN architecture \(\to\) LSTM

Lecture 5

  • A search problem consists of:

    • State space \(S\)
    • Start state \(s_{start}\in S\)
    • Possible actions \(\text{Actions}(s)\)
    • Successor: \(\text{Succ}(s,a)\)
    • Action cost \(\text{Cost}(s,a)\geq 0\)
    • Goal test \(\text{IsEnd}(s)\).

    A solution is a sequence of actions (a plan) which transforms the start state to a goal state.

  • Search heuristic: A function \(h(x)\) that estimates how close a state is to a goal.

  • Uninformed cost search: \(h(x)=0\). Informed: Can know whether one non-goal is more promising than another.

  • Admissible heuristic: \(h(x)\leq \text{the true cost to a nearest goal}\). If \(h\) is admissible, A* is optimal (can find the min-cost solution) and also optimally efficient (expand minimum nodes, which means if you expand nodes less than A* expands, you can't ensure your solution is the optimal solution).

  • A* tree search: Expand the nodes available in the increasing order of \(f(x)+h(x)\), where \(f(x)\) is the least cost from the starting state to \(x\). The algorithm is optimal if the heuristic is admissible. But the time complexity of A* Tree Search can be exponential.

  • Consistent heuristic: \(h(x)-h(y)\leq \text{cost}(x\to y)\). If \(h\) is consistent, then at the first time expanding some state, we have obtained the shortest path to the state. So every state is expanded at most once.

  • A* graph search: Similar to A* tree search, but guarantees that each node is expanded once. The algorithm is optimal if the heuristic is consistent.

Lecture 6

  • An Markov Decision Problem (MDP) consists of: state space \(S\), start state, actions \(a\), transition function \(T(s,a,s')=P(s'|s,a)\), reward function \(R(s)\) (sometimes \(R(s,a)\) or \(R(s,a,s')\)), (maybe) terminal state.

  • A solution to an MDP is a policy:

    • Non-stationary policy: function from states and times to actions.
    • Stationary policy: mapping from states to actions.
    • Both non-stationary policy and stationary policy satisfies the following property:
      • Full observability of the state
      • History-independence
      • Deterministic action choice
  • Infinite Horizon Discounted Value: \(V^{\pi}(s)=E\left[\sum_{t=0}^{\infty}\gamma^tR(s_t)\right]\). It follows Bellman Equation: \(V^{\pi}(s)=R(s)+\gamma\sum_{s'\in S}P(s'|s,\pi(s))V^{\pi}(s')\).

  • Value iteration: repeatedly perform Bellman backup operator \(B:\mathbb R^{|S|}\to\mathbb R^{|S|}\): \(BV(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V(s)\). The operator is a contraction, that for any \(V_1,V_2\) So there is unique policy \(V^*\), with property \(V^*=BV^*\).

    \[\begin{aligned} \lVert BV_1-BV_2\rVert_\infty&=\gamma\max_{s\in S}\left|\left(\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V_1(s)\right)-\left(\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V_2(s)\right)\right|\\ &\leq\gamma\max_{s\in S}\max_{a\in A}\left|\sum_{s'\in S}P(s'|s,a)(V_1(s)-V_2(s))\right|\\ &\leq\gamma\max_{s\in S}\max_{a\in A}\sum_{s'\in S}P(s'|s,a)\left|V_1(s)-V_2(s)\right|\\ &=\gamma\max_{s\in S}\max_{a\in A}|V_1(s)-V_2(s)|\\ &=\gamma\max_{s\in S}|V_1(s)-V_2(s)|\\ &=\gamma\lVert V_1-V_2\rVert_\infty \end{aligned} \]

    So \(V^*(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V^*(s')\). The policy is optimal: by induction we can prove that \(V^*(s)\geq BV(s)\geq V(s)\) if we start from any \(V^{\pi}\).

  • The convergence rate of value iteration: Assume rewards in \([0,R_{max}]\), then \(V*(s)\le\sum\limits_{t=0}^{\infty}\gamma^tR_{max}=\dfrac{R_{max}}{1-\gamma}\), so we can prove that \(\max\limits_{s\in S}|V^k(s)-V^*(s)|\le\dfrac{\gamma^k R_{max}}{1-\gamma}\) by induction. In other words, we have linear convergence to optimal value function.

  • Stopping Condition: Since \(\lVert V^k-V^{k-1}\rVert_{\infty}\le\epsilon\Rightarrow\lVert V^k-V^*\rVert_{\infty}\le\epsilon\dfrac{\gamma}{1-\gamma}\), we can continue iteration until \(\lVert V^k-V^{k-1}\rVert_{\infty}\le\epsilon\) for some small constant \(\epsilon\).

  • Policy iteration: \(\pi^{k+1}(s)\gets \operatorname{argmax}_a\sum_{s'\in S}P(s'|s,a)V^{\pi^k}(s)\) and \(V^{\pi^k}\) can be calculated by solving a linear system.

  • Modified Policy Iteration: using Bellman update with repeating \(k\) times for policy evaluation (instead of solving a linear system).

Lecture 7

  • Reinforcement Learning: Don't know transition function and reward function in an MDP.

  • Model-based approach to RL: Learn the MDP model, or an approximation of it, and use it for policy evaluation or to find the optimal policy

    Model-free approach to RL: derive the optimal policy without explicitly learning the model

  • Passive learning: The agent has a fixed policy and tries to learn the utilities of states by observing the world go by.

    • Approach 1 (model-free): Monte-Carlo direct estimation: Directly estimate \(V^{\pi}(s)\) as average total reward of episodes containing \(s\).
    • Approach 2 (model-based): Adaptive Dynamic Programming (ADP): Estimate \(P(s'|s,a)\) by sampling \(s'\), then use estimated model to compute utility of policy.
    • Approach 3 (model-free): Temporal Difference Learning (TD). After we take an action from \(s\) to \(s'\), \(V^\pi(s)\gets V^\pi(s)+\alpha(R(s)+\gamma V^{\pi}(s')-V^\pi(s))\). Intuition: \(\frac{\sum_{i=1}^{n+1}x_i}{n+1}=\frac{\sum_{i=1}^nx_i}{n}+\frac{1}{n+1}\left(x_{n+1}-\frac{\sum_{i=1}^nx_i}{n}\right)\). If we use decreasing learning rate (e.g. \(\dfrac{1}{n}\)) the estimation will converge to true value.
  • Active learning: The agent does not have a policy.

    • ADP-based RL: Start with an initial model, solve for the optimal policy given the current model (using value or policy iteration), take an action according to an exploration/exploitation policy, and update the estimated model based on an observed transition.
    • TD-based RL: Start with initial value function, take action from an exploration/exploitation policy giving new state \(s'\) (should converge to greedy policy), update estimated model, perform TD update \(V(s)\gets V(s)+\alpha(R(s)+\gamma V(s')-V(s))\).
    • Q-learning (\(Q(s, a)\) is the expected value of taking action a in state s and then following the optimal policy thereafter): Start with initial Q-function, take action from an exploration/exploitation policy giving new state \(s'\), perform TD update \(Q(s,a)\gets Q(s,a)+\alpha(R(s)+\gamma\max_{a'}Q(s',a')-Q(s,a))\).
  • Exploration policy: we would like an exploration policy that is greedy in the limit of infinite exploration (GLIE).

    • GLIE Policy 1: select to act randomly or greedily with probability regarding to \(t\).

    • GLIE Policy 2: Boltzmann Exploration. \(\text{Pr}(a|s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}\). \(T\) is the temperature. Large \(T\) means that each action has about the same probability. Small \(T\) leads to more greedy behavior. Typically start with large \(T\) and decrease with time.

标签:function,人工智能,max,sum,笔记,theta,policy,原理,gamma
From: https://www.cnblogs.com/tzcwk/p/18548779

相关文章

  • [笔记]Dijkstra算法正确性证明
    最近做了一些题,感觉对算法更深刻的理解是比套板子更深层次的,在这个层次上解决问题,思路会更加清晰。比如P5687[CSP-S2019江西]网格图(题解)这道题就是网格图的最小生成树,解法就建立在普通Kruskal的基础上,当时想了挺久也没想出来,看了题解才豁然开朗。所以各算法总是要回顾回顾的~......
  • elastic search 原理介绍
    Elasticsearch原理与实现文档字段1字段索引默认情况下,只有text类型的字段会保存文档ID、词频、词序以外,其余类型字段均只保存文档ID。用户可以在映射字段时通过index_option参数来设置,它的可选值为docs、freqs、positions、offsets,编入索引l的信息依次增加,具体含义如下:do......
  • The Missing Semester 第一讲MIT笔记
    幕布链接shell命令-幕布echo打印传入参数echohello\Worldecho"helloworld"echo$PATH(找环境变量在哪)date查看时间which查看程序所在的目录pwdpresentworkingdirectory当前所在的工作目录path在windows中路径一般为反斜杠\masOS和Linux不同为斜杠,下为绝对路径......
  • c语言笔记(鹏哥)课件+上课板书汇总(深入指针1)
    深入指针(1)⽬录:一、内存和地址二、指针变量和地址三、取地址操作符四、指针变量类型的意义(这一讲到这)五、const修饰指针六、指针运算七、野指针八、assert断⾔九、指针的使⽤和传址调⽤内存和地址引例:假设有一个宿舍楼,你在一个房间里,宿舍楼里每一间房间都......
  • 【028】基于51单片机PM2.5检测报警器【Proteus仿真+Keil程序+报告+原理图】
    ☆、设计硬件组成:51单片机最小系统+GP2Y1010AU0F粉尘传感器+ADC0832模数转换芯片+LCD1602液晶显示+按键设置+蜂鸣器+LED灯。1、本设计采用STC89C51/52、AT89C51/52、AT89S51/52单片机作为主控芯片,LCD1602实时显示信息;2、系统采用ADC0832模数转换芯片将PM2.5传感器数据读......
  • 微软利用BitNet最终把1000亿参数人工智能模型塞进了智能手表
    序言:斯坦福大学利用LoLCats技术将1000亿参数模型的训练成本降低到20美元;微软则通过BitNet技术进一步将模型量化,使其能在消费类电子产品中运行,例如个人电脑、手机、智能手表等嵌入式设备。人工智能就这样顺理成章的进入了大家的日常生活。关联的文章:《斯坦福大学推出线性前......
  • 斜率优化学习笔记
    例题:薯片小明现在体重\(W\)公斤,减肥将会持续\(n\)天。第\(i\)天如果不吃薯片体重将会减少\(A\)公斤,吃了体重会增加\(D_i\)公斤。但是不吃薯片实在是很难受,这个难受情况用压力值来描述。一开始压力值为\(0\),每一天不吃薯片压力值将会增加\(1\),吃了薯片压力值又会变回......
  • 【跟着阿舜学音乐-笔记】1.11和弦的表现形式
    和弦的分配在音乐演奏中,和弦会有纵向和横向两种分配形式。和弦最低的音铺垫了整体和弦宏观色彩与基调,一般词用和弦的根音,但并不一定。同时最低的音通常要与上方的音有一定距离,一般是大于等于五度,具体也和音区有关。所谓和声铺底,就是铺设主要和弦音的区域,一般采取1~2件乐器。......
  • C语言经典100题 学习笔记(更新中)
    第一题:有1、2、3、4四个数字,能组成多少互不相同且无重复数字的三位数?都是多少?#include<stdio.h>//有1、2、3、4四个数字//能组成多少互不相同且无重复数字的三位数?都是多少?intmain01(){ inta=0; intb=0; intc=0; intcount=0; for(a=1;a<5;a++) {......
  • 【Python学习笔记】 第10章 Python语句简介
    重温Python的知识结构程序由模块组成。模块包含语句。语句包含表达式。表达式创建并处理对象。从基础上看,Python编写的程序实际上时由语句和表达式构成的。表达式用于处理对象,并被嵌入到语句中。语句使用并引导表达式处理我们前几章所学的对象。语句可以创建对象。Python......