6.1 Motivating examples

Mean Estimation

Revisit the mean estimation problem:

Consider a random variable $X$.
Our aim is to estimate $\mathbb{E}[X]$.
Suppose that we collected a sequence of iid samples $\left\{x_i\right\}_{i=1}^N$.
The expectation of $X$ can be approximated by

\[\mathbb{E}[X] \approx \bar{x}:=\frac{1}{N} \sum_{i=1}^N x_i . \]

采样N次，把所有数据收集起来求平均

We already know from the last lecture:

This approximation is the basic idea of Monte Carlo estimation.
We know that $\bar{x} \rightarrow \mathbb{E}[X]$ as $N \rightarrow \infty$. $\bar{x}$会逐渐趋近真实值
Why do we care about mean estimation so much?
Many values in RL such as state/action values are defined as means. 这些均值需要用数据去估计

迭代计算均值

incremental and iterative manner? 来几个就先计算几个，效率更高

假设：

\[w_{k+1}=\frac{1}{k} \sum_{i=1}^k x_i, \quad k=1,2, \ldots \]

可以得到：

\[w_{k+1}=w_k-\frac{1}{k}\left(w_k-x_k\right) \]

\[w_k \rightarrow \mathbb{E}[X] \text { as } k \rightarrow \infty \]

6.2 Robbins-Monro algorithm

Stochastic approximation (SA)

SA refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems.
Compared to many other root-finding algorithms such as
gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.

Problem statement

Suppose we would like to find the root of the equation

\[g(w)=0, \]

where $w \in \mathbb{R}$ is the variable to be solved and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a function.

Many problems can be eventually converted to this root finding problem. For example, suppose $J(w)$ is an objective function to be minimized. Then, the optimization problem can be converged to

\[g(w)=\nabla_w J(w)=0 \]

梯度为0

Note that an equation like $g(w)=c$ with $c$ as a constant can also be converted to the above equation by rewriting $g(w)-c$ as a new function.

RM算法

求解$g(w)=0$的问题

The Robbins-Monro (RM) algorithm can solve this problem:

\[w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right), \quad k=1,2,3, \ldots \]

where

$w_k$ is the $k$ th estimate of the root
$\tilde{g}\left(w_k, \eta_k\right)=g\left(w_k\right)+\eta_k$ is the $k$ th noisy observation
$a_k$ is a positive coefficient.
The function $g(w)$ is a black box! This algorithm relies on data:
Input sequence: $\left\{w_k\right\}$
Noisy output sequence: $\left\{\tilde{g}\left(w_k, \eta_k\right)\right\}$

In the Robbins-Monro algorithm, if

$0<c_1 \leq \nabla_w g(w) \leq c_2$ for all $w$;
$\sum_{k=1}^{\infty} a_k=\infty$ and $\sum_{k=1}^{\infty} a_k^2<\infty$;
$\mathbb{E}\left[\eta_k \mid \mathcal{H}_k\right]=0$ and $\mathbb{E}\left[\eta_k^2 \mid \mathcal{H}_k\right]<\infty$;
where $\mathcal{H}_k=\left\{w_k, w_{k-1}, \ldots\right\}$, then $w_k$ converges with probability 1 (w.p. 1) to the root $w^*$ satisfying $g\left(w^*\right)=0$.

标签：mathbb,infty,right,梯度,eta,随机,RL,root,left
From： https://www.cnblogs.com/tuyuge/p/17624998.html

无涯教程-Perl - ord函数
描述此函数返回EXPR指定的字符的ASCII数值,如果省略则返回$_。例如,ord('A')返回值为65。语法以下是此函数的简单语法-ordEXPRord返回值该函数返回整数。例以下是显示其基本用法的示例代码-#!/usr/bin/perl-wprint("ord()",ord('G'),"\n");执行上述代码后......
无涯教程-Perl - my函数
描述此函数声明LIST中的变量在包围式块内按词法范围。如果指定了多个变量,则所有变量都必须用括号括起来。语法以下是此函数的简单语法-myLIST返回值此函数不返回任何值。例以下是显示其基本用法的示例代码-#!/usr/bin/perl-wmy$string="Wearetheworld";p......
2023-08-12 记录一则随机密码生成脚本
<!DOCTYPEhtml><htmllang="en"><head><metacharset="UTF-8"><metahttp-equiv="X-UA-Compatible"content="IE=edge"><metaname="viewport"content="width=......
无涯教程-Perl - msgsnd函数
描述此功能使用可选的FLAGS将消息MSG发送到消息队列ID。语法以下是此函数的简单语法-msgsndID,MSG,FLAGS返回值该函数在错误时返回0,在成功时返回1。参考链接https://www.learnfk.com/perl/perl-msgsnd.html......
无涯教程-Perl - msgget函数
描述此函数调用系统VIPC函数msgget(2)。返回消息队列标识,如果有错误,则返回未定义的值。语法以下是此函数的简单语法-msggetKEY,FLAGS返回值该函数将在错误时返回undef,并在成功时返回消息队列ID。参考链接https://www.learnfk.com/perl/perl-msgget.html......
无涯教程-Perl - msgrcv函数
描述此函数从队列ID接收消息,并将消息放入变量VAR中,最大大小为SIZE。语法以下是此函数的简单语法-msgrcvID,VAR,SIZE,TYPE,FLAGS返回值该函数在错误时返回0,在成功时返回1。参考链接https://www.learnfk.com/perl/perl-msgrcv.html......
无涯教程-Perl - mkdir函数
描述此功能使用MODE指定的模式创建一个名称和路径EXPR的目录,为清楚起见,应将其作为八进制值提供。语法以下是此函数的简单语法-mkdirEXPR,MODE返回值如果失败,此函数返回0,如果成功,则返回1。例以下是显示其基本用法的示例代码-#!/usr/bin/perl-w$dirname="/tm......
在消费级GPU调试LLM的三种方法：梯度检查点，LoRA和量化
前言 LLM的问题就是权重参数太大，无法在我们本地消费级GPU上进行调试，所以我们将介绍3种在训练过程中减少内存消耗，节省大量时间的方法:梯度检查点，LoRA和量化。本文转载自DeepHubIMBA仅用于学术分享，若侵权请联系删除欢迎关注公众号CV技术指南，专注于计算机视觉的技术总结、最新技......
Linux下C语言调用libcurl库下载文件到本地
一、项目介绍当前文章介绍如何使用C语言调用libcurl库在Linux（Ubuntu）操作系统下实现网络文件下载功能。libcurl是一个开源的跨平台网络传输库，用于在C和C++等编程语言中实现各种网络通信协议的客户端功能。它支持多种协议，包括HTTP、HTTPS、FTP、SMTP、POP3等，可以方便地进行数据的上传......
Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization
发表时间：2020（ICML2020）文章要点：这篇文章基于SAC做简单并且有效的改进来提升效果。作者首先认为SAC里面的entropy是为了解决actionsaturationduetotheboundednatureoftheactionspaces，这个意思就是说动作空间假如约束到[0-1]，动作通常会在0和1两个端点处，而加了entropy可......

【RL】第6课-随机近似与随机梯度下降-