首页 > 其他分享 >【RL】CH1-Basic Concepts

【RL】CH1-Basic Concepts

时间:2023-08-13 15:33:27浏览次数:54  
标签:right time mid CH1 state Concepts RL mathcal left

1.7 Markov decision processes

This section presents these concepts in a more formal way under the framework of Markov decision processes (MDPs).

An MDP is a general framework for describing stochastic dynamical systems. The key ingredients of an MDP are listed below.


  • State space: the set of all states, denoted as \(\mathcal{S}\).
  • Action space: a set of actions, denoted as \(\mathcal{A}(s)\), associated with each state \(s \in \mathcal{S}\).
  • Reward set: a set of rewards, denoted as \(\mathcal{R}(s, a)\), associated with each state-action pair \((s, a)\).


  • State transition probability: At state \(s\), when taking action \(a\), the probability of transitioning to state \(s^{\prime}\) is \(p\left(s^{\prime} \mid s, a\right)\). It holds that \(\sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right)=1\) for any \((s, a)\).
  • Reward probability: At state \(s\), when taking action \(a\), the probability of obtaining reward \(r\) is \(p(r \mid s, a)\). It holds that \(\sum_{r \in \mathcal{R}(s, a)} p(r \mid s, a)=1\) for any \((s, a)\).

Policy: At state \(s\), the probability of choosing action \(a\) is \(\pi(a \mid s)\). It holds that \(\sum_{a \in \mathcal{A}(s)} \pi(a \mid s)=1\) for any \(s \in \mathcal{S}\).

Markov property: The Markov property refers to the memoryless property of a stochastic process. Mathematically, it means that

\[\begin{aligned} & p\left(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0\right)=p\left(s_{t+1} \mid s_t, a_t\right), \\ & p\left(r_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0\right)=p\left(r_{t+1} \mid s_t, a_t\right), \end{aligned} \]

where \(t\) represents the current time step and \(t+1\) represents the next time step. Equation (1.4) indicates that the next state or reward depends merely on the current state and action and is independent of the previous ones. The Markov property is important for deriving the fundamental Bellman equation of MDPs, as shown in the next chapter.

Here, \(p\left(s^{\prime} \mid s, a\right)\) and \(p(r \mid s, a)\) for all \((s, a)\) are called the model or dynamics. The model can be either stationary or nonstationary (or in other words, time-invariant or time-variant). A stationary model does not change over time; a nonstationary model may vary over time. For instance, in the grid world example, if a forbidden area may pop up or disappear sometimes, the model is nonstationary. In this book, we only consider stationary models.

From: https://www.cnblogs.com/tuyuge/p/17626632.html


  • 无涯教程-Perl - readlink函数
  • 无涯教程-Perl - quotemeta函数
  • 无涯教程-Perl - push函数
  • 无涯教程-Perl - printf函数
  • Linux下C语言调用libcurl库获取天气预报信息
  • python urllib爬虫的坑 gzip.BadGzipFile: Not a gzipped file
    一句话返回的数据不是gzip加密的打印一下返回的headers数据有一个Content-Encoding就是返回数据的加密方式根据相应的解密就可以  建议把发送的请求里加密方式只留一个gzip或者其他方便解密  还有遇到一个问题就是发送请求目标网站返回的数据一会是加密的一会是原......
  • 无涯教程-Perl - package函数
  • 无涯教程-Perl - pack函数
  • 【RL】第6课-随机近似与随机梯度下降-
  • 无涯教程-Perl - ord函数