首页 > 其他分享 >How to Train your Bee

How to Train your Bee

时间:2024-09-19 14:34:27浏览次数:7  
标签:value next states How state Train every action your

How to Train your Bee

Assignment 2 Help Guide© Dreamworks, ”Bee Movie”CALCULATE AN OPTIMAL POLICY FORRecap: Bellman Equation

  • Bellman Equation is used to calculate the optimal value of a state
  • Equation looks complicated, but it’s just the highest expected reward from the best action:
  • ? ?! ?, ?) is the probability of entering the next state ?! given we perform action ? in state ?
  • ? ?, ?, ?! is the reward received for performing action ? in state ? and entering the next state ?!
  • ? is the discount factor (makes immediate rewards worth more than future rewards)
  • ?∗ ?! is the value of the next state ?!
  • The optimal policy, ?, is the best available action that can be performed in each state
  • The value of a state is given by the highest expected reward when following the optimalpolicyAssignment 2 HelpGuideRecap: Value Iteration
  1. Arbitrarily assign a value to each state (e.g. seteach state to 0)
  1. Until convergence
  • Calculate the Q(s,a) values for every state for thatiteration, and determine the action that maximises Q(s,a)

for each state

  • Calculate the value for every state using the optimalaction for that iteration
  • Convergence occurs if the difference in statevalues between two iterations is less than somevalue ?
  • Tutorial 6 involves implementing Value Iteration ithe simple Gridworld environmentAssignment 2 HelpGuid? ?, ? = ? ?, ? + ?)! ?" ?, ? ?#$%(?" )Recap: Policy Iteration
  1. Arbitrarily assign a policy to each state (e.g.action to be performed in every state is LEFT)
  1. Until convergence
  • Policy Evaluation: Determine the value of every state basedon the current policy
  • Policy Improvement: Determine the best action to beperformed in every state based on the values of the currentpolicy, then update the policy based on the new bestaction
  • Convergence occurs if the policy between two

iterations does not changeTutorial 7 involves implementing Policy Iteration inthe simple Gridworld environment

Assignment 2 HelpGuideComputing State Space

  • Both Value Iteration and Policy Iteration require us to loop through every state
  • Value Iteration: to determine the current best policy and the value of a state
  • Policy Iteration: to determine the new best policy based on the current value of a state
  • We need a way to compute every state so we can loop through them all
  • One way is to get every combination of states possible
  • In BeeBot, for a given level, each state is given by the position and orientation of the bee, the position andorientation of the widgets
  • We can determine the list of all states by computing every combination of bee position and orientation, widgetposition, and widget orientation
  • However, this might include some invalid combinations (e.g., widget or bee are inside a wall)
  • Is there a better way we can search for possible states?Assignment 2 HelpGuideTransition Outcomes
  • The probabilistic aspect of this assignment means that by performing certain actions incertain states, there might be multiple next states that can be reached each with a differentprobability and reward
  • The assignment involves creating a get_transition_outcomes(state, action)function
  • Takes a (state, action) pair and for every possible next state returns the probability of ending up in that stateand the reward
  • Should return a list or other data structure with all the next states, probabilities, and rewards
  • This will be useful when utilising the Bellman Equation to calculate the value of a state
  • When creating the transition function, there are a few things to consider:
  • What are the random aspects of the BeeBot environment?
  • What are the possible next states to consider from a given state?
  • What are edge cases that need to be considered (e.g. moving near a wall or thorn)?

Assignment 2 HelpGuideTransition Outcomes

  • The transition function will usually assume a given action is valid, so we need to only feed it

actions that are valid for given states to avoid any odd behaviour

  • We can cache if actions are valid to improve runtime
  • perform_action(state, action)
  • Might help you understand how to determine the possible next states for certain states and actions
  • However, note that it only returns one possible state for a given action
  • We can cache the results of the transition function to improve runtime
  • Tutorial 6 involves creating a transition function for the simple Gridworld environment

Assignment 2 HelpGuideTerminal States

  • We need to create a way to handle terminal states when calculating the values and optimalpolicies of states, otherwise the agent might think it can leave the terminal states
  • There are two ways we can model the terminal states to do this
  • Terminal states
  • Set the value of a terminal state to 0
  • Skip over it without updating its value if it is encountered in a loop
  • Absorbing states
  • Create a new state outside of the regular state space to send the agent to once it reaches a terminal state
  • If the player performs any action代 写How to Train your Bee in the absorbing state, it remains in the absorbing state
  • The reward is always 0 for the absorbing state, no matter the action performed

Assignment 2 HelpGuideReward Function

  • Rewards are not dependent on the resulting state, but on the state the agent is in and theaction it performs
  • Reward functions considered up until this point have been R(s), solely based on the statethe agent is in
  • For BeeBot, the expected reward function is R(s, a) – actions also give rewards that need tobe considered, as well as any possible penalties
  • We can use get_transition_outcomes(state, action) to get the rewards:
  • We can start by initialising a matrix of all zeroes size |S| x |A|
  • Then, loop over each (state, action) pair and initialise the total expected reward to 0
  • Loop over the outcomes from get_transition_outcomes(state, action) and add the (probability xeward) to compute the expected reward over all outcomes
  • i.e. ? ?, ? = ???????? ?????? = ∑#! ? ?! ?, ?) ⋅ ?(?, ?, ?! )Assignment 2 HelpGuideValue Iteration: Updating State Values
  • How we choose to update states can affect the performance of our value iteration
  • Batch updates uses the value of the next state from the previous iteration to update thevalue of the current state in the current iteration
  • In-place updates uses the value of the next state from the current iteration, if it has already

been calculated, to update the value of the current state in the current iteration

  • If the next state has not yet been calculated in the current iteration, it uses the value from the previous iteration
  • In-place updates typically converges in fewer iterations
  • The order in which states are calculated also has an effect for in-place updates (i.e. startingnear the goal and working backwards may enable faster convergence)

Assignment 2 HelpGuidePolicy Iteration: Linear Algebra

  • The numpy library is allowed for this assignment, as well as built-in python libraries
  • We can use linear algebra to compute the value of states for the policy evaluation step ofPolicy Iteration
  • ? is the identity matrix of size |S| x |S|
  • ?$ is a matrix containing the transition probabilities based on the current policy
  • ? is a vector containing rewards for every state based on the current policy
  • numpy can be used to perform linear algebra
  • vπ = np.linalg.solve(I − γPπ, r)Assignment 2 HelpGuideImproving Runtime
  • We need to compute the state space to calculate the value of every state
  • Calculating the value of every state can be time consuming, especially for large levels
  • We can remove unreachable states from the state space and only consider states the agent can reach
  • This can improve runtime as we are reducing the number of states to calculate the value of
  • Remember to use caching where possible
  • If you are repeatedly computing something, caching can drastically improve runtime
  • Remember to limit use of inefficient data structures
  • If you're checking whether an element is in a list (e.g. if next state is in the states list), either modify your code

to remove the need for this, or use a data structure with more efficient lookup (e.g. a set or dictionary)Assignment 2 HelpGuide“Flying is exhausting. Why don't you humans just run everywhere, isn't that faster?” - Barry B. Benson Assignment 2 HelpGuide

标签:value,next,states,How,state,Train,every,action,your
From: https://www.cnblogs.com/WX-codinghelp/p/18420381

相关文章

  • 论文解读《MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Trai
    系列文章目录文章目录系列文章目录论文细节理解1、研究背景2、论文贡献3、方法框架4、研究思路5、实验6、限制论文细节理解Ensembleteacher.在深度学习领域,什么意思?在深度学习领域,“ensembleteacher”通常指的是一种模型集成的方法,其中多个模型(教师模型)共同训......
  • [Vue] v-html、v-show、v-if 的区别
    v-if作用:根据条件动态创建或销毁DOM元素。原理:v-if实际上是按需渲染的,它会根据表达式的真假值来决定是否在DOM树中插入或移除元素。如果条件为false,该元素不会存在于DOM中。性能:因为v-if会真正地添加或移除元素,因此它在初次渲染时的开销较大,特别是当需要频繁切换时会......
  • Align Your Prompts论文解读: Test-Time Prompting with Distribution Alignment for
    Comment:AcceptedtoNeurIPS2023对齐提示:用于zero-shot泛化的测试时提示分布对齐摘要CLIP等视觉语言模型的zero-shot泛化已经引领它们在下游任务中使用提示学习。先前的工作已经表明使用熵最小化进行测试时提示调优,调整文本提示适应未见过的领域。尽管这样的方法非常高效......
  • [CTFshow] 文件包含 78~88,116~117
    web78if(isset($_GET['file'])){$file=$_GET['file'];include($file);}else{highlight_file(__FILE__);}php伪协议,data://数据流封装器,以传递相应格式的数据。通常可以用来执行PHP代码?file=data://text/plain,<?=system('ls')?>?file=dat......
  • 2017 ACM/ICPC Asia Regional Qingdao Online(SDKD 2024 Summer Training Contest J2)
    C-TheDominatorofStrings题意给定n个串,问是否有一个串包含其他所有串,有就输出这个串。思路如果有解,答案必定是最长串,一一比较即可。(没想到.find()就能过......
  • Defining Constraints with ObjectProperties
    步骤4:使用对象定义约束物业您可以创建时间和放置约束,如本教程所示。你也可以更改单元格的属性以控制Vivado实现如何处理它们。许多物理约束被定义为单元对象的属性。例如,如果您在设计中发现RAM存在时序问题,为了避免重新合成,您可以更改RAM单元的属性以添加流水线寄存器。......
  • 一款类excel可进行显示、在线编辑的纯js表格TableShow控件
        在进行前端显示设计时,传统的方法是以分页显示,逐条提取后修改及保存,非常不利于用户连贯阅读及在线修改。因此,本人将类似excel的一些table表格在线卷动显示、修改及集中保存功能进行了尝试,封装成了一个纯js控件,只通过一句代码进行调用,将数据库查询结果集进行显示和添......
  • How to Add a Built-in Function to TiDB Using a Cursor in 20 Minutes
    作者:ngautInthistutorial,we’llwalkyouthroughthestepstoaddaflexiblebuilt-infunctiontoTiDBthatleveragesLLMtoprocessdata.We’llusetheAI_PROCESSfunctionasanexample,whichtakestwoparameters:adatastringandatask_description......
  • MySQL show processlist说明
    showprocesslist和showfullprocesslistprocesslist命令的输出结果显示了有哪些线程在运行,不仅可以查看当前所有的连接数,还可以查看当前的连接状态帮助识别出有问题的查询语句等。如果是root帐号,能看到所有用户的当前连接。如果是其他普通帐号,则只能看到自己占用的连接。showp......
  • [题解]CF542A Place Your Ad Here
    思路首先因为电视台比广告多一个信息,所以通常来说枚举电视台是更有前途的。因此枚举每一个电视台,考虑所有广告的贡献。对于一个电视台,\(c_i\)是定值,也就是找到所有广告与电视台所表示区间交得最多的那一个。假设枚举的电视台控制了\([L,R]\)区间,则广告\([l,r]\)会有三种方......