TD learning of state values
The data/experience required by the algorithm:
- \(\left(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots\right)\) or \(\left\{\left(s_t, r_{t+1}, s_{t+1}\right)\right\}_t\) generated following the given policy \(\pi\).
The TD learning algorithm is
\[\begin{aligned} & v_{t+1}\left(s_t\right)=v_t\left(s_t\right)-\alpha_t\left(s_t\right)\left[v_t\left(s_t\right)-\left[r_{t+1}+\gamma v_t\left(s_{t+1}\right)\right]\right], \\ & v_{t+1}(s)=v_t(s), \quad \forall s \neq s_t \end{aligned} \]where \(t=0,1,2, \ldots\) Here, \(v_t\left(s_t\right)\) is the estimated state value of \(v_\pi\left(s_t\right)\); \(\alpha_t\left(s_t\right)\) is the learning rate of \(s_t\) at time \(t\).
s: state space
标签:right,Temporal,L7,state,learning,RL,ldots,left From: https://www.cnblogs.com/tuyuge/p/17626795.html