论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights

标签：采样策略 Importance Experience buffer Replay replay 作者方法

摘要：经验回放（Experience Replay），即使用过去的经验来加速价值函数的时序差分（TD）学习，是深度强化学习的关键组成部分。对重要的经验进行优先排序或重新加权已经被证明可以提高TD学习算法的性能。在这项工作中，我们建议根据经验在当前策略的平稳分布下的出现概率进行重放加权，这隐含地鼓励在频繁遇到的状态上减少价值函数的近似误差。具体实现时，我们在 replay buffer 上使用无似然密度比估计器（likelihood-free density ratio estimator）来分配优先级权重。我们将提出的方法应用于 SAC 和 TD3 这两种有竞争力的方法，并在一系列 OpenAI gym 任务上进行实验。我们发现，与其他 Baseline 方法相比，我们的方法实现了更高的性能和样本效率（superior sample complexity）

文章目录

1. 本文方法

1.1 思想

1.1.1 建立直觉
1.1.2 形式化描述

1.2 理论分析
1.3 使用快慢缓冲区估计当前策略诱导的
1.4 伪代码

2. 实验
3. 分析 & 讨论

1. 本文方法

1.1 思想

1.1.1 建立直觉

这篇文章考虑的是对非均匀经验重放设计重放优先级的问题，关于这方面的背景请参考论文理解【RL - Exp Replay】—— An Equivalence between Loss Functions and Non-Uniform Sampling in Exp Replay 第 1 节
过去的重放优先级设计通常是针对学习最优价值 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改$ （如 PER），这时 TD target 由 Bellman optimal equation 给出。学习 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_02$ 通常意味着我们要利用它调整策略（很可能是 value-based 方法如 DQN），因此策略不稳定，其诱导的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_03$ 的分布也不稳定，这时通常基于 TD error 设计优先级，使当前价值估计尽快靠近最新的 TD target，从而加速策略优化，从直觉上想也是比较合理的；本文针对 actor-critic 框架中的 critic 设计重放优先级，目标是学习某个平稳策略的价值函数 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_04$ ，这时再简单地根据 TD error 设计优先级可能就不是很合理了，考虑当前策略诱导的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_03$

此时对 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_04$ 估计值的调整不会影响到策略（其实更新策略时也会影响，但是作者这里强行单独考虑价值估计的部分），多重放这个 transition 可以让这里的价值估计更好，但是这个 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_07$
相比而言，多重放那些经常访问到的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_07$ 可以让当前策略诱导的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_07$

根据上述分析，作者认为应该按策略诱导 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_07$ 分布来设计重放优先级，优先优化那些经常访问的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_07$ 。这个出发点我觉得还是有点牵强，作者强行把 actor 和 critic 分开考虑了

1.1.2 形式化描述

形式地化说明一下作者的想法，虽然作者的目标是设计一个优先级的非均匀经验重放，这可以通过论文理解【RL - Exp Replay】—— An Equivalence between Loss Functions and Non-Uniform Sampling in Exp Replay 这篇文章的方法转换为一个等价的（即损失的期望梯度相等的）均匀重放形式，仅需对损失函数做一点修改，请看下图

论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_12

此图显示了通过构造损失使得两个分布下的期望梯度相等的方法，可以把这里的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_13$ 看做一个真正的非均匀重放分布，把 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_14$ 看作 replay buffer 上的均匀分布， $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_15$ 是原本的非均匀重放损失，只要向上图那样考虑重要度采样比构造新损失 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_16$ ，就能保证 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_17$ 。换句话说，随便一个以 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_16$ 为损失的均匀重放机制，都可以反向找出一个转换的重要性采样比将其对应到一个等价的使用另一个损失 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_15$ 和分布 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_13$ 的非均匀重放机制。因此作者的核心目标就是设计一个新的损失 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_21$ ，它要能体现出对高频 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_07$

注： $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_23$

看一下公式

Bellman operator： $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_24$
Bellman equation： $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_25$
在 replay buffer 分布 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_26$ 下的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_23$ 损失： $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_28$ 其中 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_29$ 是指考虑了采样误差，在样本数量属于无穷时有 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_30$
假设 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_31$ 是 replay buffer 采样自的分布，且样本量无限（ $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_32$ ），引入优先级权重 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_33$ ，损失变为 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_34$ 注意到 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_31$ 和 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_36$ 都是系数，于是可以设 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_37$ ，从而有
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_38$ 根据作者的观点，应该将加权系数选为 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_39$

1.2 理论分析

作者选择 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_40$ 作为加权系数的原因，除了 1.1.1 节给出的直觉以外，还有一个重要的原因是：当 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_04$ 价值的距离度量设置为按 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_42$ 加权的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_23$ 。这里涉及到价值收敛的理论证明，更好地满足压缩映射原理意味着更好的收敛性质，这里可以参考强化学习拾遗 —— 表格型方法和函数近似方法中 Bellman 迭代的收敛性分析
Bellman 算子之所以能收敛，是因为动作状态价值空间 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_44$ 本身是一个 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_45$ 空间，而 Bellman 算子是该空间上的一个压缩映射，也就是说对于 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_46$ ，有
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_47$ 虽然这足以显示收敛结果，但是无穷范数 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_48$ 只能反映最坏的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_03$ 作用于 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_50$ 和 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_51$ 上的差距，这里没有考虑到和策略的相关性。距离来说，如果两个 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_50$ 和 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_51$ 只在某个 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_03$ 处有很大差距，其他位置处处相等，则他们在 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_55$ 这个度量下相距很远，但在实践中 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_50$ 和 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_51$ 几乎没有差别，因为当状态动作空间足够大时，策略采样到这个特殊 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_03$ 的概率很小。因为我们要学习的是 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_59$ ，选择一个和 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_60$ 有关的测度可能更合适，这样能反映出 1.1.1 节讨论的高频 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_07$
作者在这里提出使用按平稳策略 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_60$ 诱导的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_07$ 分布 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_31$ 加权的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_23$ ，即
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_66$ 这个测度和前面按分布 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_67$ 加权的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_68$ 损失具有相同的形式
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_69$ 作者进而证明了一个 Theorem 1，说明当且仅当 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_70$ 时 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_71$ ，即
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_72$ 其中 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_73$ 是当前策略 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_74$ 诱导的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_03$
总之，作者对一开始的直觉找出了一个理论支撑，总结一下就是

考虑 Bellman 算子的收敛映射特性时，我们应该使用和策略相关的测度，以得到更快的收敛速度
这个测度可以设计为按 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_31$ 加权的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_23$ 距离 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_71$ ，当且仅当 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_79$ 时它是和策略相关的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_80$ -压缩映射
因此 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_81$ 是对 Q-function 而言更好的距离度量
将这个更好的距离度量应用到损失中，损失应该设计为 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_82$

接下来作者进行了一个小实验说明其想法的有效性

论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_83

如左图可见这是一个三状态 MDP，agent 只有达到 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_84$ 时可以得到 1 的奖励，待评估的策略设计为：在每个状态下执行正确动作（靠近 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_84$ 的动作）的概率为 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_86$ ，各个 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_03$ 的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_50$ 价值从 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_89$ 的均分布中采样初始化，考虑 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_90$ 和 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_91$ 两种情况，每个 epoch 会计算所有 transition，按一下 TD 更新公式模拟按学习率 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_92$ 加权的效果
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_93$ 实验效果如右图所示，可见按 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_73$

1.3 使用快慢缓冲区估计当前策略诱导的

现在我们只要想办法估计在每轮迭代时估计出当前策略 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_74$ 诱导的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_03$ 分布 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_73$

使用 on-policy 方法，每一轮迭代前用 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_60$ 和环境大量交互，利用交互数据估计 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_99$ ，显然这样样本复杂度太高了
使用 off-policy 方法，利用重要性采样比对 replay buffer 中的历史经验分布进行调整来得到 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_99$ ，这时的问题在于采样比 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_101$ 很难估计（这里 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_26$

可见，基于似然的方法（我理解是估计 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_74$ 下 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_73$ 的概率的方法）在这里并不好用，因此作者使用了无似然的概率密度比估计方法（likelihood-free density ratio estimation methods）进行处理，仅靠 replay buffer 中的样本估计出当前的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_99$

作者在此利用了一个引理：假设 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_106$ 在 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_107$ 上有一阶导数 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_108$ ， $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_109$ 和 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_110$ ，有
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_111$ 其中 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_112$ 是凸共轭， $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_113$ 是两个概率密度间的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_106$ 散度，当 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_115$

注： $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_116$ -散度（f -divergences）：对于任意下连续凸函数（convex, lower-semicontinuous） $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_117$ ，要求满足 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_118$ ，则对于两个概率密度 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_119$ （要求 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_120$ ，即 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_121$ 关于 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_04$ 绝对连续 absolutely continuous）， $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_116$ -散度定义为
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_124$ 通过设置 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_116$ ，可以得到 KL 散度等多种散度

注意等号相等时的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_36$ ，对应我们的需求，将 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_127$ 替换为当前策略分布 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_73$ ， $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_50$ 替换为 replay buffer 分布 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_130$ ，它被 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_113$
为了估计出 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_132$ ，使用以下三个步骤

设置一大一小两个 replay buffer，大的称为 regular(slow) replay buffer，小的称为 smaller(fast) replay buffer，每次和环境交互后都用最新的经验去更新两个 buffer。由于尺寸不同

slow buffer 中的样本变化速度较慢，含有更多来自过去历史策略的混合 transition，off-policy 性质更强，可以看做采样自 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_优先级_133$
fast buffer 中的样本变化速度较快，只含有少量近期策略的交互样本，on-policy 性质更强，当 fast buffer 尺寸较小时可以近似看做采样自 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_134$

分别用 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_135$ 和 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_136$ 表示两个 buffer，使用一个 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_137$ 参数化的神经网络 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_138$ 来拟合重要性采样比 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_139$ （因为概率比不可能为负，通过激活函数限制其输出为非负数），优化目标是最大化
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_140$ 这个优化的目标的意思就是要尽量增大被 upper bound 的部，使得等号近似成立，从而得到合理的 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_33$ ，转换为损失函数形式只需取相反数
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_142$
最后使用一个带温度系数 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_143$ 步骤解决 finite sample issue 并得到合法的概率形式
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_144$

经过上述操作即可得到重要性采样比，TD 学习的可以表示为
$论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_145$ 其中 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_146$ 使用 MC 采样的方法估计。这可以以插件的形式结合到各种 off-policy actor-critic 方法中去

1.4 伪代码

如图所示

2. 实验

作者将其方法应用到 SAC 和 TD3 上，于均匀采样及 PER 方案进行对比，实验使用 gym 环境进行。超参数设置为 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_147$ ， $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_148$ 使用两层全连接网络，每层 256 个神经元，ReLU 激活函数，散度计算时使用 JS 散度 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_149$
和 SAC 结合的效果

论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_150

和 TD3 结合的效果

论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_151

表格总结

论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_152

可见作者的方法可以在大部分任务上实现更高的性能，且样本效率更高（收敛较快）

3. 分析 & 讨论

本文方法对超参数比较敏感，两个缓冲区尺寸需要针对任务设计，如果一个任务很快就能收敛并维持在较好水平，那么就该把 slow buffer 设置小一点
作者考察了学到 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_148$ 的精度，他将用 SAC 训练过 5M 步后的交互经验标为正例；训练 1~4M 步中的混合经验标为负例，使用 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_经验重放_148$ 来区分，结果为 “precision of 87.3% and an accuracy of 73.1%”，说明通常能判断正确，用 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_155$
作者还考察了学到 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_50$ （更靠近真实 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_replay buffer_157$ ）
实验做得不太好，主要的对比方法 PER 是针对学习 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_02$ 设计的，本文方法则是针对 AC 框架下学习 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_点修改_50$ ；另外相关研究中作者也提到了其他以提升 on-policy 性质为动机的方法（如 ReF-ER），应该对比一下
使用快慢 buffer 估计 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_160$ 的想法有点意思，不过看文章也是从 IRL 等相关研究借鉴过来的。更好地估计 $论文理解【RL - Exp Replay】—— 【LFIW】Experience Replay with Likelihood-free Importance Weights_强化学习_160$
这篇问题优化损失时的测度和做 Bellman 迭代时一样，这样是不是能给这个 DRL 方法建立收敛性证明

标签：采样,策略,Importance,Experience,buffer,Replay,replay,作者,方法
From： https://blog.51cto.com/u_15887260/5889770