How to Train your Bee
Assignment 2 Help Guide© Dreamworks, ”Bee Movie”CALCULATE AN OPTIMAL POLICY FORRecap: Bellman Equation
- Bellman Equation is used to calculate the optimal value of a state
- Equation looks complicated, but it’s just the highest expected reward from the best action:
- ? ?! ?, ?) is the probability of entering the next state ?! given we perform action ? in state ?
- ? ?, ?, ?! is the reward received for performing action ? in state ? and entering the next state ?!
- ? is the discount factor (makes immediate rewards worth more than future rewards)
- ?∗ ?! is the value of the next state ?!
- The optimal policy, ?, is the best available action that can be performed in each state
- The value of a state is given by the highest expected reward when following the optimalpolicyAssignment 2 HelpGuideRecap: Value Iteration
- Arbitrarily assign a value to each state (e.g. seteach state to 0)
- Until convergence
- Calculate the Q(s,a) values for every state for thatiteration, and determine the action that maximises Q(s,a)
for each state
- Calculate the value for every state using the optimalaction for that iteration
- Convergence occurs if the difference in statevalues between two iterations is less than somevalue ?
- Tutorial 6 involves implementing Value Iteration ithe simple Gridworld environmentAssignment 2 HelpGuid? ?, ? = ? ?, ? + ?)! ?" ?, ? ?#$%(?" )Recap: Policy Iteration
- Arbitrarily assign a policy to each state (e.g.action to be performed in every state is LEFT)
- Until convergence
- Policy Evaluation: Determine the value of every state basedon the current policy
- Policy Improvement: Determine the best action to beperformed in every state based on the values of the currentpolicy, then update the policy based on the new bestaction
- Convergence occurs if the policy between two
iterations does not changeTutorial 7 involves implementing Policy Iteration inthe simple Gridworld environment
Assignment 2 HelpGuideComputing State Space
- Both Value Iteration and Policy Iteration require us to loop through every state
- Value Iteration: to determine the current best policy and the value of a state
- Policy Iteration: to determine the new best policy based on the current value of a state
- We need a way to compute every state so we can loop through them all
- One way is to get every combination of states possible
- In BeeBot, for a given level, each state is given by the position and orientation of the bee, the position andorientation of the widgets
- We can determine the list of all states by computing every combination of bee position and orientation, widgetposition, and widget orientation
- However, this might include some invalid combinations (e.g., widget or bee are inside a wall)
- Is there a better way we can search for possible states?Assignment 2 HelpGuideTransition Outcomes
- The probabilistic aspect of this assignment means that by performing certain actions incertain states, there might be multiple next states that can be reached each with a differentprobability and reward
- The assignment involves creating a get_transition_outcomes(state, action)function
- Takes a (state, action) pair and for every possible next state returns the probability of ending up in that stateand the reward
- Should return a list or other data structure with all the next states, probabilities, and rewards
- This will be useful when utilising the Bellman Equation to calculate the value of a state
- When creating the transition function, there are a few things to consider:
- What are the random aspects of the BeeBot environment?
- What are the possible next states to consider from a given state?
- What are edge cases that need to be considered (e.g. moving near a wall or thorn)?
Assignment 2 HelpGuideTransition Outcomes
- The transition function will usually assume a given action is valid, so we need to only feed it
actions that are valid for given states to avoid any odd behaviour
- We can cache if actions are valid to improve runtime
- perform_action(state, action)
- Might help you understand how to determine the possible next states for certain states and actions
- However, note that it only returns one possible state for a given action
- We can cache the results of the transition function to improve runtime
- Tutorial 6 involves creating a transition function for the simple Gridworld environment
Assignment 2 HelpGuideTerminal States
- We need to create a way to handle terminal states when calculating the values and optimalpolicies of states, otherwise the agent might think it can leave the terminal states
- There are two ways we can model the terminal states to do this
- Terminal states
- Set the value of a terminal state to 0
- Skip over it without updating its value if it is encountered in a loop
- Absorbing states
- Create a new state outside of the regular state space to send the agent to once it reaches a terminal state
- If the player performs any action代 写How to Train your Bee in the absorbing state, it remains in the absorbing state
- The reward is always 0 for the absorbing state, no matter the action performed
Assignment 2 HelpGuideReward Function
- Rewards are not dependent on the resulting state, but on the state the agent is in and theaction it performs
- Reward functions considered up until this point have been R(s), solely based on the statethe agent is in
- For BeeBot, the expected reward function is R(s, a) – actions also give rewards that need tobe considered, as well as any possible penalties
- We can use get_transition_outcomes(state, action) to get the rewards:
- We can start by initialising a matrix of all zeroes size |S| x |A|
- Then, loop over each (state, action) pair and initialise the total expected reward to 0
- Loop over the outcomes from get_transition_outcomes(state, action) and add the (probability xeward) to compute the expected reward over all outcomes
- i.e. ? ?, ? = ???????? ?????? = ∑#! ? ?! ?, ?) ⋅ ?(?, ?, ?! )Assignment 2 HelpGuideValue Iteration: Updating State Values
- How we choose to update states can affect the performance of our value iteration
- Batch updates uses the value of the next state from the previous iteration to update thevalue of the current state in the current iteration
- In-place updates uses the value of the next state from the current iteration, if it has already
been calculated, to update the value of the current state in the current iteration
- If the next state has not yet been calculated in the current iteration, it uses the value from the previous iteration
- In-place updates typically converges in fewer iterations
- The order in which states are calculated also has an effect for in-place updates (i.e. startingnear the goal and working backwards may enable faster convergence)
Assignment 2 HelpGuidePolicy Iteration: Linear Algebra
- The numpy library is allowed for this assignment, as well as built-in python libraries
- We can use linear algebra to compute the value of states for the policy evaluation step ofPolicy Iteration
- ? is the identity matrix of size |S| x |S|
- ?$ is a matrix containing the transition probabilities based on the current policy
- ? is a vector containing rewards for every state based on the current policy
- numpy can be used to perform linear algebra
- vπ = np.linalg.solve(I − γPπ, r)Assignment 2 HelpGuideImproving Runtime
- We need to compute the state space to calculate the value of every state
- Calculating the value of every state can be time consuming, especially for large levels
- We can remove unreachable states from the state space and only consider states the agent can reach
- This can improve runtime as we are reducing the number of states to calculate the value of
- Remember to use caching where possible
- If you are repeatedly computing something, caching can drastically improve runtime
- Remember to limit use of inefficient data structures
- If you're checking whether an element is in a list (e.g. if next state is in the states list), either modify your code
to remove the need for this, or use a data structure with more efficient lookup (e.g. a set or dictionary)Assignment 2 HelpGuide“Flying is exhausting. Why don't you humans just run everywhere, isn't that faster?” - Barry B. Benson Assignment 2 HelpGuide
标签:value,next,states,How,state,Train,every,action,your From: https://www.cnblogs.com/WX-codinghelp/p/18420381