首页 > 其他分享 >Tabular Value-Based Reinforcement Learning

Tabular Value-Based Reinforcement Learning

时间:2023-10-30 21:58:10浏览次数:32  
标签:environment Based transition Reinforcement Tabular state Learning action

Reading Notes about the book Deep Reinforcement Learning written by Aske Plaat

Recently, I have been reading the book Deep Reinforcement Learning written by Aske Plaat. This book is a good introduction to the theory of Deep Reinforcement Learning. And it is very inspiring when I learn the theory of Deep Reinforcement Learning.

Tabular Value-Based Reinforcement Learning

Tabular Value-Based Agents

Reinforcement learning paradigm: an agent and an environment. The origin of this concept can be traced back to the process that human interacts with the objective world. I ploted a graphical depiction to demonstrate the relationship showed as follows.

Fig.1 Interaction between human and the world

  • Agent and Environment
    Formally, we can represent the relationship in the figure above.

    • \(S_t\stackrel{a_t}{\longrightarrow} S_{t+1}\)
    • \(S_{t+1}\stackrel{r_{t+1}}{\longrightarrow} a_{t+1}\)
  • Features

    • The environment gives us only a number as an indication of the quality of an action;
    • We can generate as many action-reward pairs as we need, without a large hand-labeled dataset (action and reward, action and reward...)

Markov Decision Process

Sequential decision problems can be modelled as Markov decision processes(MDPs).
The Markov property

  • the next state depends only on the current state and the actions available in it(no-memory property)

Define a Markov decision process for reinforcement learning as a 5-tuple (\(S, A,T_a,R_a,\gamma\))

  • \(S\) is a finite set of legal states of the environment

  • \(A\) is a finite set of actions(\(A_s\) is the finite set of actions in state \(s\))

  • \(T_a(s, s')=Pr(s_{t+1}=s'|s_t=s, a_t=a)\) is the probability that action \(a\) in state \(s\) at time \(t\) will transition to state \(s'\) at time \(t+1\)

  • \(R_a(s,s')\) is the reward received after action \(a\) transitions state \(s\) to state \(s'\)

  • \(\gamma\in[0, 1]\) is the discount factor representing the difference between future and present rewards.

  • State \(S\)

    • State Presentation: the state \(s\) contains the information to uniquely represent the configuration of the environment.
    • Deterministic Environment
      In discrete deterministic environments the transition function defines a one-step transition. That is, each action deterministically leads to a single new state.
    • Stochastic Environment
      The outcome of the action is unknown beforehand by the agent because of continuous state space. And that result depends on elements in the environment.
      So we can conclude that the stachastic environment determines the stochastic states.
  • Action \(A\)

    • An action changes the state of the environment irreversibly.
    • Actions that the agent performs are also known as its behavior, just as the human's behavior.
    • Discrete or Continuous Action Space
      • Action Space is related to the specific application for example, the actions in board games are discrete, while the actions in robotics are continuous.
      • Value-Based methods work well for discrete action spaces, and Policy-Based methods work well for both action spaces.
  • Transsition \(T_a\)

    • Model-Free Reinforcement Learning
      Only the environment has access to the transition function while the agent has not. In this pattern, the transition \(T_a(s, s')\) equals to the nature laws that is internal to the environment, which the agent does not know.
    • Model-Based Reinforcement Learning
      There the agent has its own transition function, an approximation of the environment's transition function, which is learned from the environment feedback. In my opinion, that is just our policy experince which is summarized from the past feedbacks.

The dynamics of the MDP are modelled by transition function \(T_a(\cdot)\) and reward function \(R_a(\cdot)\).
In Reinforcement Learning, reward learning is learning by backpropagation. In the dicision tree, action selection moves down, reward learning flows up. To be detailed, the downward selection policy chooses which actions to explore, and the upward propagation of the error signal performs the learning of the policy.

  • Reward \(R_a\)
    Rewards are associated with single states, indicating their quality.

  • Value Function \(V^\pi(S)\)
    Usually, we are most often interested in the quality of a full decision making sequence from root to leaves. So the expected cumulative discounted future reward of a state is called the value function.

  • Discount Factor \(\gamma\)
    In continuous and long running tasks it makes sense to discount rewards from far in the future in order to more strongly value current information at the present time. In my view, it's just weights factor for every time point.

  • Policy \(\pi\)
    The policy \(\pi\) is a conditional probability distribution that for each possible state specifies the probability of each possible action. Formally, the function \(\pi\) is a mapping from the state space to a probability distribution over the action space:

\[\pi: S\rightarrow p(A) \]

For a particular probability from this distribution we notes: \(\pi(a|s)\).
A special case of a policy is a deterministic policy, denoted by \(\pi(s)\), and the mapping:

\[\pi: S\rightarrow A \]

标签:environment,Based,transition,Reinforcement,Tabular,state,Learning,action
From: https://www.cnblogs.com/MarkStiff/p/17798932.html

相关文章

  • Markov Decision Process Model Based on Value Iteration
    TheoriesMarkovDecisionProcessGenerally,wenotesaMDPmodelas\((S,A,T_a,R_a,\gamma)\).Itstransitionfunctionis\(T_a(s,s')=\Pr(s_{t+1}|s_t=s,a_t=a)\),rewardfunctionis\(R_a(s,s')\).Andactionschoosingsatisfiesaspec......
  • Signal Filters Design Based on Digital Signal Processing
    ThoeriesI.FourierSeriesExpansionAlgorithmWecanutilizetheFourierSeriestoproducetheanalogsignalwithsomefrequencycomponents.Foranysignal,itsFourierseriesexpansionisdefinedas\[x(t)=\frac{A_0}{2}+\sum_{n=1}^{\infty}A_n\cos......
  • 论文阅读:DeepKE:A Deep Learning Based Knowledge Extraction Toolkit for Knowledge B
    DeepKE,支持数据集和模型的结合来实现非结构化数据中信息的提取。同时提出框架和一系列的组件来实现足够的模块化和可扩展性。项目地址1.Introduction现存的KB是在实体和关系方面是不完备的。常见的一些标志性的应用:Spacy(实体识别)OpenNER(关系提取)OpenIE(信息提取)RESIN(事......
  • Transformer-based Encoder-Decoder Models
    整理原链接内容方便阅读https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Encoder_Decoder_Model.ipynbtitle:"Transformer-basedEncoder-DecoderModels"thumbnail:/blog/assets/05_encoder_decoder/thumbnail.pngauthors:user:p......
  • Codeforces Round 902 (Div. 2, based on COMPFEST 15 - Final Round)
    \(D.EffectsofAntiPimples\)对每个数字能到达的所有位置先预处理最大值,那么就代表选择这个数字之后真实的贡献,那么对这样的预处理值,最小值显然只有一种做法,为\(2^0\),第二小的值应该可以与最小值一起选择,所以答案为\(2^1\),以此类推之后,每个值乘上对应的2的幂次之后求和即......
  • Paper Reading: Sample and feature selecting based ensemble learning for imbalanc
    目录研究动机文章贡献本文方法基于聚类的分层随机欠采样特征选择样本和特征选择的集成学习基于随机森林的SFSHEL实验结果数据集和实验设置KEEL数据集的比较HeartFailure数据集的比较优点和创新点PaperReading是从个人角度进行的一些总结分享,受到个人关注点的侧重和实力所限......
  • GRLSTM:基于图的残差LSTM轨迹相似性计算《GRLSTM: Trajectory Similarity Computation
    2023年10月18日,14:14。来不及了,这一篇还是看的翻译。论文:GRLSTM:TrajectorySimilarityComputationwithGraph-BasedResidualLSTM(需要工具才能访问)Github: AAAI2023的论文。 摘要轨迹相似性的计算是许多空间数据分析应用中的一项关键任务。然而,现有的方法主要是......
  • [909] Remove duplicated rows based on multiple columns in Pandas
    InaPandasDataFrame,youcanremoveduplicatedrowsbasedonmultiplecolumnsusingthedrop_duplicates()method.Here'showyoucandoit:importpandasaspd#SampleDataFramedata={'A':[1,2,3,2,1],'B':[�......
  • open_basedir(PHP可访问目录)
       open_basedir指令用来限制PHP只能访问那些目录,通常我们只需要设置Web文件目录即可。如果需要include加载外部脚本,也需要把脚本所在目录路径加入到open_basedir指令中,多个目录以分号(;)分割。   使用open_basedir需要注意的一点是,指定的显示实际上是前缀,而不是目录名......
  • 论文阅读(三)——Channel-wise Topology Refinement Graph Convolution for Skeleton-Ba
    代码实验pythonmain.py--configconfig/nturgbd-cross-subject/default.yaml--work-dirwork_dir/ntu/csub/ctrgcn--device0--num-worker0综述......