DQN是强化学习中的一种方法,是对Q-Learning的扩展。
通过引入深度神经网络、经验回放和目标网络等技术,使得Q-Learning算法能够在高维、连续的状态空间中应用,解决了传统Q-Learning方法在这些场景下的局限性。
Q-Learning可以见之前的文章。
算法的几个关键点:
1. 深度学习估计状态动作价值函数:DQN利用Q-Learning算法思想,估计一个Q函数Q(s,a),表示在状态s下采取a动作得到的期望回报,估计该函数时利用深度学习的方法。
2. 经验回放:为了打破数据的相关性和提高样本效率,DQN引入了经验回放池。智能体在与环境交互时,会将每一个时间步的经验(s,a,r,s')存入回放池,每次更新网络时,随机从回放池中抽取一个小批量经验进行训练。
3. 目标网络:DQN算法使用两个神经网络:一个在线网络(用于选择动作)和一个目标网络(用于计算目标Q值)。目标网络的参数每隔一段时间才会从在线网络复制,以稳定训练过程。目标Q值的计算公式为:y = r+g*max(Q(s',a')),其中r为奖励,g为折扣因子,Q为目标网络。
代码如下:
import gym import random import warnings import torch import torch.nn as nn import torch.optim as optim warnings.filterwarnings("ignore") class Net(nn.Module): def __init__(self, input_size, hidden_size, output_size): super().__init__() self.linear1 = nn.Linear(input_size, hidden_size) self.linear2 = nn.Linear(hidden_size, hidden_size) self.linear3 = nn.Linear(hidden_size, output_size) def forward(self, x): x = torch.relu(self.linear1(x)) x = torch.relu(self.linear2(x)) x = self.linear3(x) return x if __name__ == '__main__': negative_reward = -10.0 positive_reward = 10.0 x_bound = 1.0 gamma = 0.9 batch_size = 64 capacity = 1000 buffer=[] env = gym.make('CartPole-v1') state_space_num = env.observation_space.shape[0] action_space_dim = env.action_space.n q_net = Net(state_space_num, 256, action_space_dim) target_q_net = Net(state_space_num, 256, action_space_dim) optimizer = optim.Adam(q_net.parameters(), lr=5e-4) for i in range(3000): state = env.reset() step = 0 while True: # env.render() step +=1 epsi = 1.0 / (i + 1) if random.random() < epsi: action = random.randrange(action_space_dim) else: state_tensor = torch.tensor(state, dtype=torch.float).view(1,-1) action = torch.argmax(q_net(state_tensor)).item() next_state, reward, done, _ = env.step(action) x, x_dot, theta, theta_dot = state if (abs(x) > x_bound): r1 = 0.5 * negative_reward else: r1 = negative_reward * abs(x) / x_bound + 0.5 * (-negative_reward) if (abs(theta) > env.theta_threshold_radians): r2 = 0.5 * negative_reward else: r2 = negative_reward * abs(theta) / env.theta_threshold_radians + 0.5 * (-negative_reward) reward = r1 + r2 if (done) and (step < 499): reward += negative_reward if len(buffer)==capacity: buffer.pop(0) buffer.append((state, action, reward, next_state)) state = next_state if len(buffer) < batch_size: continue samples = random.sample(buffer,batch_size) s0, a0, r1, s1 = zip(*samples) s0 = torch.tensor( s0, dtype=torch.float) a0 = torch.tensor( a0, dtype=torch.long).view(batch_size, 1) r1 = torch.tensor( r1, dtype=torch.float).view(batch_size, 1) s1 = torch.tensor( s1, dtype=torch.float) q_value = q_net(s0).gather(1, a0) q_target = r1 + gamma * torch.max(target_q_net(s1).detach(), dim=1)[0].view(batch_size, -1) loss_fn = nn.MSELoss() loss = loss_fn(q_value, q_target) optimizer.zero_grad() loss.backward() optimizer.step() if i % 10==0: target_q_net.load_state_dict(q_net.state_dict()) if done: print(i,step) break env.close()
基本在迭代100多次之后都能稳定到500步。
标签:CartPole,space,Python,torch,state,action,DQN,reward,size From: https://www.cnblogs.com/tiandsp/p/18133282