1.7 Markov decision processes

This section presents these concepts in a more formal way under the framework of Markov decision processes (MDPs).

An MDP is a general framework for describing stochastic dynamical systems. The key ingredients of an MDP are listed below.

Sets:

State space: the set of all states, denoted as $\mathcal{S}$.
Action space: a set of actions, denoted as $\mathcal{A}(s)$, associated with each state $s \in \mathcal{S}$.
Reward set: a set of rewards, denoted as $\mathcal{R}(s, a)$, associated with each state-action pair $(s, a)$.

Model:

State transition probability: At state $s$, when taking action $a$, the probability of transitioning to state $s^{\prime}$ is $p\left(s^{\prime} \mid s, a\right)$. It holds that $\sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right)=1$ for any $(s, a)$.
Reward probability: At state $s$, when taking action $a$, the probability of obtaining reward $r$ is $p(r \mid s, a)$. It holds that $\sum_{r \in \mathcal{R}(s, a)} p(r \mid s, a)=1$ for any $(s, a)$.

Policy: At state $s$, the probability of choosing action $a$ is $\pi(a \mid s)$. It holds that $\sum_{a \in \mathcal{A}(s)} \pi(a \mid s)=1$ for any $s \in \mathcal{S}$.

Markov property: The Markov property refers to the memoryless property of a stochastic process. Mathematically, it means that

\[\begin{aligned} & p\left(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0\right)=p\left(s_{t+1} \mid s_t, a_t\right), \\ & p\left(r_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0\right)=p\left(r_{t+1} \mid s_t, a_t\right), \end{aligned} \]

where $t$ represents the current time step and $t+1$ represents the next time step. Equation (1.4) indicates that the next state or reward depends merely on the current state and action and is independent of the previous ones. The Markov property is important for deriving the fundamental Bellman equation of MDPs, as shown in the next chapter.

Here, $p\left(s^{\prime} \mid s, a\right)$ and $p(r \mid s, a)$ for all $(s, a)$ are called the model or dynamics. The model can be either stationary or nonstationary (or in other words, time-invariant or time-variant). A stationary model does not change over time; a nonstationary model may vary over time. For instance, in the grid world example, if a forbidden area may pop up or disappear sometimes, the model is nonstationary. In this book, we only consider stationary models.

标签：right,time,mid,CH1,state,Concepts,RL,mathcal,left
From： https://www.cnblogs.com/tuyuge/p/17626632.html

无涯教程-Perl - readlink函数
描述此函数返回链接EXPR指向的文件的路径名；如果未指定EXPR,则返回$_语法以下是此函数的简单语法-readlinkEXPRreadlink返回值该函数在出错时返回undef,否则返回文件的路径名。例以下是显示其基本用法的示例代码-#!/usr/bin/perl-w#assume/tmp/testisasymb......
无涯教程-Perl - quotemeta函数
描述此函数转义EXPR中的所有元字符。例如,quotemeta("AB*..C")返回"'AB\*\。\。C"。语法以下是此函数的简单语法-quotemetaEXPR返回值此函数返回一个字符串,其中所有元字符均已转义。例以下是显示其基本用法的示例代码-#!/usr/bin/perl-wprintquotemeta("AB......
无涯教程-Perl - push函数
描述此函数将LIST中的值压入列表ARRAY的末尾。与pop一起使用以实现堆栈。语法以下是此函数的简单语法-pushARRAY,LIST返回值此函数返回新数组中的元素数。例以下是显示其基本用法的示例代码-#!/usr/bin/perl-w$,=",";@array=(1,2);print"Beforepushinge......
无涯教程-Perl - printf函数
描述此函数将通过FORMAT指定的格式打印的LIST值打印到当前输出文件句柄或FILEHANDLE指定的句柄。有效等效于打印FILEHANDLEsprintf(FORMAT,LIST)如果不需要特定的输出格式,则可以使用print代替printf。以下是可接受的格式转换列表。Sr.No.Format&Result%%百分号......
Linux下C语言调用libcurl库获取天气预报信息
一、概述当前文章介绍如何在Linux（Ubuntu）下使用C语言调用libcurl库获取天气预报的方法。通过HTTPGET请求访问百度天气API，并解析返回的JSON数据，可以获取指定城市未来7天的天气预报信息。二、设计思路【1】使用libcurl库进行HTTPGET请求在代码中包含<curl/curl.h>头文件，以便使用libc......
python urllib爬虫的坑 gzip.BadGzipFile: Not a gzipped file
一句话返回的数据不是gzip加密的打印一下返回的headers数据有一个Content-Encoding就是返回数据的加密方式根据相应的解密就可以建议把发送的请求里加密方式只留一个gzip或者其他方便解密还有遇到一个问题就是发送请求目标网站返回的数据一会是加密的一会是原......
无涯教程-Perl - package函数
描述此函数将当前符号表的名称更改为NAME。包名称的范围一直到封闭块的末尾。如果省略NAME,则没有当前包,并且所有函数和变量名称都必须使用其完全限定的名称声明。语法以下是此函数的简单语法-packageNAMEpackage返回值此函数不返回任何值。要了解package关键字,......
无涯教程-Perl - pack函数
描述此函数对LIST中的表达式求值并将其打包为EXPR指定的二进制结构。使用下表中显示的字符指定格式-每个字符可以可选地跟一个数字,该数字指定要打包的值的类型的重复计数。根据格式,该值是半字节,字符或什至位。*的值重复*,因为LIST中保留了尽可能多的值。可以使用拆包功能将......
【RL】第6课-随机近似与随机梯度下降-
第6课-随机近似与随机梯度下降6.1MotivatingexamplesMeanEstimationRevisitthemeanestimationproblem:Considerarandomvariable$X$.Ouraimistoestimate$\mathbb{E}[X]$.Supposethatwecollectedasequenceofiidsamples\(\left\{x_i\right\}_{i......
无涯教程-Perl - ord函数
描述此函数返回EXPR指定的字符的ASCII数值,如果省略则返回$_。例如,ord('A')返回值为65。语法以下是此函数的简单语法-ordEXPRord返回值该函数返回整数。例以下是显示其基本用法的示例代码-#!/usr/bin/perl-wprint("ord()",ord('G'),"\n");执行上述代码后......

【RL】CH1-Basic Concepts

1.7 Markov decision processes

相关文章

赞助商

阅读排行