在编写客制的深度强化学习环境时,有时候需要使用到智能体多维动作空间的应用。
比如说,我们设计的环境是一个打砖块游戏,这时智能体需要产生一个[左,右,不动]的动作概率分布,智能体动作空间只产生一个维度:[0.2,0.4,0.4]
此时,我们需要设计板来打砖块,而且是一个智能体,这时候智能体产生的动作空间就会变成多维,动作概率如下所示:[[0.2,0.4,0.4],[0.2,0.4,0.4]]
那么这时候,智能体的动作概率分布有何不同,更新时会碰到什么问题呢?
在使用PPO算法更新时,首先使用环境与智能体交互,积累经验,然后进行学习。
更新时,首先利用经验计算优势函数。计算完优势函数,开始进行智能体的抽样训练,具体的写法如下:
点击查看代码
def update(self, replay_buffer, total_steps):
s, a, a_logprob, r, s_, dw, done = replay_buffer.numpy_to_tensor() #经验存储池
adv = []
gae = 0
with torch.no_grad(): # adv and v_target have no gradient
vs = self.critic(s)
vs_ = self.critic(s_)
deltas = r + self.gamma * (1.0 - dw) * vs_ - vs
for delta, d in zip(reversed(deltas.flatten().numpy()), reversed(done.flatten().numpy())):
gae = delta + self.gamma * self.lamda * gae * (1.0 - d)
adv.insert(0, gae)
adv = torch.tensor(adv, dtype=torch.float).view(-1, 1)
v_target = adv + vs
if self.use_adv_norm: # Trick 1:advantage normalization
adv = ((adv - adv.mean()) / (adv.std() + 1e-5))
# Optimize policy for K epochs:
for _ in range(self.K_epochs):
# Random sampling and no repetition. 'False' indicates that training will continue even if the number of samples in the last time is less than mini_batch_size
for index in BatchSampler(SubsetRandomSampler(range(self.batch_size)), self.mini_batch_size, False):
dist_now = Categorical(probs=self.actor(s[index]))
dist_entropy = dist_now.entropy().view(-1, 1) # shape(mini_batch_size X 1)
a_logprob_now = dist_now.log_prob(a[index].squeeze()).view(-1, 1) # shape(mini_batch_size X 1)
ratios = torch.exp(a_logprob_now - a_logprob[index]) # 这里的log_now产生的是[128,1]的tensor数据,而log产生的是[64,1]的tensor数据
surr1 = ratios * adv[index] # Only calculate the gradient of 'a_logprob_now' in ratios
surr2 = torch.clamp(ratios, 1 - self.epsilon, 1 + self.epsilon) * adv[index]
actor_loss = -torch.min(surr1, surr2) - self.entropy_coef * dist_entropy # shape(mini_batch_size X 1)
# Update actor
self.optimizer_actor.zero_grad()
actor_loss.mean().backward()
if self.use_grad_clip: # Trick 7: Gradient clip
torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
self.optimizer_actor.step()
v_s = self.critic(s[index])
critic_loss = F.mse_loss(v_target[index], v_s)
# Update critic
self.optimizer_critic.zero_grad()
critic_loss.backward()
if self.use_grad_clip: # Trick 7: Gradient clip
torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
self.optimizer_critic.step()
if self.use_lr_decay: # Trick 6:learning rate Decay
self.lr_decay(total_steps)
涉及到多维动作空间时,智能体更新部分,产生pi的比值时,智能体会将抽取的状态送入策略网络,产生新的策略,存储的策略作为旧策略。
但是此时旧策略的尺寸仅为[batch_size,1](因为提前将经验存储的原因,这里在抽取时大小与batch_size是相同的),而新产生的策略尺寸为[batch_size*2,1],因此会产生以下错误:
点击查看代码
RuntimeError: The size of tensor a (batch_size*2) must match the size of tensor b (batch_size) at non-singleton dimension 0
综上,涉及多维动作空间时,会产生旧策略与新策略尺寸不匹配的问题,此时也不能将新策略强行降为[batch_size,1],否则会产生数据错乱。
所以如何在保持多维动作空间特征的同时,还能保持策略大小匹配,就需要使用到联合分布。
使用联合分布来表示多维动作的空间,将二维的动作转化成一个联合分布的点,可解决上述问题。
在pytorch中,已经内置了联合分布函数:torch.distributions.Independent
在具体操作中,针对上述更新过程,作如下改动:
点击查看代码
def update(self, replay_buffer, total_steps):
s, a, a_logprob, r, s_, dw, done = replay_buffer.numpy_to_tensor() # Get training data
a_logprob = torch.sum(a_logprob, dim=1, keepdim=True) #结合动作分布
# Calculate advantages using GAE
adv = []
gae = 0
with torch.no_grad():
vs = self.critic(s)
vs_ = self.critic(s_)
deltas = r + self.gamma * (1.0 - dw) * vs_ - vs
for delta, d in zip(reversed(deltas.flatten().numpy()), reversed(done.flatten().numpy())):
gae = delta + self.gamma * self.lamda * gae * (1.0 - d)
adv.insert(0, gae)
adv = torch.tensor(adv, dtype=torch.float).view(-1, 1)
v_target = adv + vs
if self.use_adv_norm: # Trick 1: advantage normalization
adv = (adv - adv.mean()) / (adv.std() + 1e-5)
# Optimize policy for K epochs:
for _ in range(self.K_epochs):
for index in BatchSampler(SubsetRandomSampler(range(self.batch_size)), self.mini_batch_size, False):
action_mean_now = self.actor(s[index])
dist_now = Categorical(probs=action_mean_now)
independent_dist = Independent(dist_now, reinterpreted_batch_ndims=1) #将产生的多维动作空间联合
entropy = independent_dist.entropy().mean() #计算分布的熵,并取其均值。
a_logprob_now = independent_dist.log_prob(a[index].squeeze()).view(-1, 1)
index_tensor = torch.tensor(index, dtype=torch.int64)
index=torch.clamp(index_tensor,0,a_logprob.shape[0]-1)
a_logprob = a_logprob[index].view(-1, 1)
ratios = torch.exp(a_logprob_now - a_logprob)
surr1 = ratios * adv[index]
surr2 = torch.clamp(ratios, 1 - self.epsilon, 1 + self.epsilon) * adv[index]
actor_loss = -torch.min(surr1, surr2) - self.entropy_coef * entropy
self.optimizer_actor.zero_grad()
actor_loss.mean().backward()
if self.use_grad_clip: # Trick 7: Gradient clip
torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
self.optimizer_actor.step()
v_s = self.critic(s[index])
critic_loss = F.mse_loss(v_target[index], v_s)
self.optimizer_critic.zero_grad()
critic_loss.backward()
if self.use_grad_clip: # Trick 7: Gradient clip
torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
self.optimizer_critic.step()
if self.use_lr_decay: # Trick 6: learning rate Decay
self.lr_decay(total_steps)
以上操作解决多维动作空间更新时动作概率分布size不匹配的问题。
标签:adv,index,self,torch,智能,critic,深度,多维,size From: https://www.cnblogs.com/wonx3/p/18365705