How to Train your Bee

标签：value next states How state Train every action your

Assignment 2 Help Guide© Dreamworks, ”Bee Movie”CALCULATE AN OPTIMAL POLICY FORRecap: Bellman Equation

Bellman Equation is used to calculate the optimal value of a state
Equation looks complicated, but it’s just the highest expected reward from the best action:
? ?! ?, ?) is the probability of entering the next state ?! given we perform action ? in state ?
? ?, ?, ?! is the reward received for performing action ? in state ? and entering the next state ?!
? is the discount factor (makes immediate rewards worth more than future rewards)
?∗ ?! is the value of the next state ?!
The optimal policy, ?, is the best available action that can be performed in each state
The value of a state is given by the highest expected reward when following the optimalpolicyAssignment 2 HelpGuideRecap: Value Iteration

Arbitrarily assign a value to each state (e.g. seteach state to 0)

Until convergence

Calculate the Q(s,a) values for every state for thatiteration, and determine the action that maximises Q(s,a)

for each state

Calculate the value for every state using the optimalaction for that iteration

Convergence occurs if the difference in statevalues between two iterations is less than somevalue ?

Tutorial 6 involves implementing Value Iteration ithe simple Gridworld environmentAssignment 2 HelpGuid? ?, ? = ? ?, ? + ?)! ?" ?, ? ?#$%(?" )Recap: Policy Iteration

Arbitrarily assign a policy to each state (e.g.action to be performed in every state is LEFT)

Until convergence

Policy Evaluation: Determine the value of every state basedon the current policy

Policy Improvement: Determine the best action to beperformed in every state based on the values of the currentpolicy, then update the policy based on the new bestaction

Convergence occurs if the policy between two

iterations does not changeTutorial 7 involves implementing Policy Iteration inthe simple Gridworld environment

Assignment 2 HelpGuideComputing State Space

Both Value Iteration and Policy Iteration require us to loop through every state
Value Iteration: to determine the current best policy and the value of a state
Policy Iteration: to determine the new best policy based on the current value of a state
We need a way to compute every state so we can loop through them all
One way is to get every combination of states possible
In BeeBot, for a given level, each state is given by the position and orientation of the bee, the position andorientation of the widgets

We can determine the list of all states by computing every combination of bee position and orientation, widgetposition, and widget orientation

However, this might include some invalid combinations (e.g., widget or bee are inside a wall)
Is there a better way we can search for possible states?Assignment 2 HelpGuideTransition Outcomes

The probabilistic aspect of this assignment means that by performing certain actions incertain states, there might be multiple next states that can be reached each with a differentprobability and reward

The assignment involves creating a get_transition_outcomes(state, action)function

Takes a (state, action) pair and for every possible next state returns the probability of ending up in that stateand the reward

Should return a list or other data structure with all the next states, probabilities, and rewards
This will be useful when utilising the Bellman Equation to calculate the value of a state
When creating the transition function, there are a few things to consider:
What are the random aspects of the BeeBot environment?
What are the possible next states to consider from a given state?
What are edge cases that need to be considered (e.g. moving near a wall or thorn)?

Assignment 2 HelpGuideTransition Outcomes

The transition function will usually assume a given action is valid, so we need to only feed it

actions that are valid for given states to avoid any odd behaviour

We can cache if actions are valid to improve runtime
perform_action(state, action)
Might help you understand how to determine the possible next states for certain states and actions
However, note that it only returns one possible state for a given action
We can cache the results of the transition function to improve runtime
Tutorial 6 involves creating a transition function for the simple Gridworld environment

Assignment 2 HelpGuideTerminal States

We need to create a way to handle terminal states when calculating the values and optimalpolicies of states, otherwise the agent might think it can leave the terminal states

There are two ways we can model the terminal states to do this
Terminal states
Set the value of a terminal state to 0
Skip over it without updating its value if it is encountered in a loop
Absorbing states
Create a new state outside of the regular state space to send the agent to once it reaches a terminal state
If the player performs any action代写How to Train your Bee in the absorbing state, it remains in the absorbing state
The reward is always 0 for the absorbing state, no matter the action performed

Assignment 2 HelpGuideReward Function

Rewards are not dependent on the resulting state, but on the state the agent is in and theaction it performs

Reward functions considered up until this point have been R(s), solely based on the statethe agent is in

For BeeBot, the expected reward function is R(s, a) – actions also give rewards that need tobe considered, as well as any possible penalties

We can use get_transition_outcomes(state, action) to get the rewards:
We can start by initialising a matrix of all zeroes size |S| x |A|
Then, loop over each (state, action) pair and initialise the total expected reward to 0
Loop over the outcomes from get_transition_outcomes(state, action) and add the (probability xeward) to compute the expected reward over all outcomes

i.e. ? ?, ? = ???????? ?????? = ∑#! ? ?! ?, ?) ⋅ ?(?, ?, ?! )Assignment 2 HelpGuideValue Iteration: Updating State Values

How we choose to update states can affect the performance of our value iteration
Batch updates uses the value of the next state from the previous iteration to update thevalue of the current state in the current iteration

In-place updates uses the value of the next state from the current iteration, if it has already

been calculated, to update the value of the current state in the current iteration

If the next state has not yet been calculated in the current iteration, it uses the value from the previous iteration
In-place updates typically converges in fewer iterations
The order in which states are calculated also has an effect for in-place updates (i.e. startingnear the goal and working backwards may enable faster convergence)

Assignment 2 HelpGuidePolicy Iteration: Linear Algebra

The numpy library is allowed for this assignment, as well as built-in python libraries
We can use linear algebra to compute the value of states for the policy evaluation step ofPolicy Iteration
? is the identity matrix of size |S| x |S|
?$ is a matrix containing the transition probabilities based on the current policy
? is a vector containing rewards for every state based on the current policy
numpy can be used to perform linear algebra
vπ = np.linalg.solve(I − γPπ, r)Assignment 2 HelpGuideImproving Runtime

We need to compute the state space to calculate the value of every state
Calculating the value of every state can be time consuming, especially for large levels
We can remove unreachable states from the state space and only consider states the agent can reach
This can improve runtime as we are reducing the number of states to calculate the value of
Remember to use caching where possible
If you are repeatedly computing something, caching can drastically improve runtime
Remember to limit use of inefficient data structures
If you're checking whether an element is in a list (e.g. if next state is in the states list), either modify your code

to remove the need for this, or use a data structure with more efficient lookup (e.g. a set or dictionary)Assignment 2 HelpGuide“Flying is exhausting. Why don't you humans just run everywhere, isn't that faster?” - Barry B. Benson Assignment 2 HelpGuide

标签：value,next,states,How,state,Train,every,action,your
From： https://www.cnblogs.com/WX-codinghelp/p/18420381

相关文章

赞助商

阅读排行