创作不易,您的打赏、关注、点赞、收藏和转发是我坚持下去的动力!
强化学习(Reinforcement Learning,RL)是一种通过与环境交互来学习策略的机器学习方法。RL主要包含以下几个关键组件:状态(State)、动作(Action)、奖励(Reward)、策略(Policy)和价值函数(Value Function)。常见的强化学习主流算法框架主要包括以下几类:**值函数方法(Q-Learning、SARSA)、策略梯度方法(REINFORCE)、Actor-Critic方法(A2C、PPO)**等。以下将对这些算法框架及原理进行详细介绍,并提供相应的Python示例代码。
一、值函数方法
值函数方法通过估计每个状态或状态-动作对的价值来选择动作。
1. Q-Learning
Q-Learning 是一种基于价值的强化学习算法,用于寻找最优策略。它通过更新Q值来学习状态-动作对的价值函数。
核心公式:
[
Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a’} Q(s’, a’) - Q(s, a) \right)
]
其中:
- ( Q(s, a) ) 为当前Q值
- ( \alpha ) 为学习率
- ( r ) 为即时奖励
- ( \gamma ) 为折扣因子
示例代码:
import numpy as np
import gym
# 创建环境
env = gym.make('FrozenLake-v1', is_slippery=False)
num_states = env.observation_space.n
num_actions = env.action_space.n
# 初始化Q表
Q = np.zeros((num_states, num_actions))
# 超参数
alpha = 0.1 # 学习率
gamma = 0.99 # 折扣因子
epsilon = 0.1 # 探索率
num_episodes = 1000
# Q-Learning算法
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
# 选择动作
if np.random.rand() < epsilon:
action = env.action_space.sample() # 探索
else:
action = np.argmax(Q[state, :]) # 利用
# 执行动作
next_state, reward, done, _ = env.step(action)
# 更新Q值
Q[state, action] = Q[state, action] + alpha * (
reward + gamma * np.max(Q[next_state, :]) - Q[state, action])
state = next_state
print("Q表:")
print(Q)
二、策略梯度方法
策略梯度方法直接优化策略,不需要维护Q值。通过对策略参数进行优化来最大化期望奖励。
2. REINFORCE
REINFORCE是一种基于蒙特卡洛方法的策略梯度算法。
核心公式:
[
\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a|s) G_t
]
其中:
- ( \pi_\theta(a|s) ) 为策略函数
- ( G_t ) 为从时间 ( t ) 开始的累计回报
示例代码:
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
# 定义策略网络
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super(PolicyNetwork, self).__init__()
self.fc = nn.Linear(state_dim, action_dim)
def forward(self, x):
return torch.softmax(self.fc(x), dim=-1)
# 创建环境
env = gym.make('CartPole-v1')
policy_net = PolicyNetwork(state_dim=env.observation_space.shape[0], action_dim=env.action_space.n)
optimizer = optim.Adam(policy_net.parameters(), lr=0.01)
# REINFORCE算法
def reinforce(num_episodes=1000):
for episode in range(num_episodes):
state = env.reset()
log_probs = []
rewards = []
done = False
while not done:
state = torch.tensor(state, dtype=torch.float32)
action_probs = policy_net(state)
action = np.random.choice(len(action_probs.detach().numpy()), p=action_probs.detach().numpy())
log_prob = torch.log(action_probs[action])
log_probs.append(log_prob)
state, reward, done, _ = env.step(action)
rewards.append(reward)
# 计算累计回报
G = 0
returns = []
for r in reversed(rewards):
G = r + 0.99 * G
returns.insert(0, G)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-5)
# 更新策略参数
loss = 0
for log_prob, G in zip(log_probs, returns):
loss -= log_prob * G
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("训练完成")
reinforce()
三、Actor-Critic方法
Actor-Critic方法结合了值函数和策略梯度的优点,具有更快的收敛速度。
3. Advantage Actor-Critic (A2C)
A2C是一种同步的Actor-Critic算法,具有优势函数估计,既学习状态值函数,又学习策略。
核心公式:
[
\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
]
Actor更新:
[
\theta \leftarrow \theta + \alpha \delta_t \nabla_\theta \log \pi_\theta(a_t|s_t)
]
Critic更新:
[
w \leftarrow w + \alpha \delta_t \nabla_w V(s_t)
]
示例代码:
import torch.nn.functional as F
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim):
super(ActorCritic, self).__init__()
self.fc = nn.Linear(state_dim, 128)
self.action_head = nn.Linear(128, action_dim)
self.value_head = nn.Linear(128, 1)
def forward(self, x):
x = F.relu(self.fc(x))
action_probs = F.softmax(self.action_head(x), dim=-1)
state_values = self.value_head(x)
return action_probs, state_values
env = gym.make('CartPole-v1')
ac_net = ActorCritic(state_dim=env.observation_space.shape[0], action_dim=env.action_space.n)
optimizer = optim.Adam(ac_net.parameters(), lr=0.01)
def a2c(num_episodes=1000):
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
state = torch.tensor(state, dtype=torch.float32)
action_probs, state_value = ac_net(state)
action = np.random.choice(len(action_probs.detach().numpy()), p=action_probs.detach().numpy())
next_state, reward, done, _ = env.step(action)
next_state = torch.tensor(next_state, dtype=torch.float32)
_, next_state_value = ac_net(next_state)
td_error = reward + (1 - done) * 0.99 * next_state_value - state_value
# 更新Actor
actor_loss = -torch.log(action_probs[action]) * td_error.detach()
# 更新Critic
critic_loss = td_error ** 2
optimizer.zero_grad()
(actor_loss + critic_loss).backward()
optimizer.step()
state = next_state.numpy()
print("A2C训练完成")
a2c()
以上介绍了强化学习的主要算法框架及其Python代码示例,包括值函数方法(Q-Learning)、策略梯度方法(REINFORCE)、以及Actor-Critic方法(A2C)。通过这些示例,可以更好地理解强化学习的基本原理和实现方法。