QMIX：用于深度多智能体强化学习的单调值函数分解

In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.
在许多现实世界中，一组代理必须协调他们的行为，同时以分散的方式行事。与此同时，通常可以在模拟或实验室环境中以集中方式训练代理，其中可以获得全局状态信息并解除通信约束。学习以额外状态信息为条件的联合动作值是利用集中式学习的一种有吸引力的方法，但提取分散式策略的最佳策略尚不清楚。我们的解决方案是QMIX，这是一种新颖的基于价值的方法，可以以集中的端到端方式训练分散的策略。QMIX采用一个网络，该网络将联合动作值估计为每个代理值的复杂非线性组合，仅以局部观察为条件。我们在结构上强制联合行动值在每个代理值中是单调的，这允许在非策略学习中最大化联合行动值，并保证集中式和分散式策略之间的一致性。我们评估QMIX在一组具有挑战性的星际争霸II微观管理任务，并表明QMIX显着优于现有的基于值的多智能体强化学习方法。

Machine Learning, ICML 机器学习，ICML

1 Introduction 引言

Reinforcement learning (RL) holds considerable promise to help address a variety of cooperative multi-agent problems, such as coordination of robot swarms (Hüttenrauch et al., 2017) and autonomous cars (Cao et al., 2012).
强化学习（RL）在帮助解决各种协作多智能体问题方面具有相当大的前景，例如机器人群的协调（Hüttenrauch等人，2017）和自动汽车（Cao等人，2012年）。

In many such settings, partial observability and/or communication constraints necessitate the learning of decentralised policies, which condition only on the local action-observation history of each agent. Decentralised policies also naturally attenuate the problem that joint action spaces grow exponentially with the number of agents, often rendering the application of traditional single-agent RL methods impractical.
在许多这样的设置中，部分可观测性和/或通信约束需要学习分散的政策，其条件仅限于每个代理的本地动作观察历史。分散的策略也自然地减弱了联合行动空间随代理数量呈指数级增长的问题，这通常使传统的单代理RL方法的应用变得不切实际。

Fortunately, decentralised policies can often be learned in a centralised fashion in a simulated or laboratory setting. This often grants access to additional state information, otherwise hidden from agents, and removes inter-agent communication constraints. The paradigm of centralised training with decentralised execution (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) has recently attracted attention in the RL community (Jorge et al., 2016; Foerster et al., 2018). However, many challenges surrounding how to best exploit centralised training remain open.
幸运的是，分散的策略通常可以在模拟或实验室环境中以集中的方式学习。这通常赠款对附加状态信息的访问权，否则对代理隐藏，并消除代理间的通信约束。集中培训与分散执行的范例（Oliehoek等人，2008; Kraemer & Banerjee，2016）最近在RL社区引起了关注（Jorge et al.，2016; Foerster等人，2018年）。然而，围绕如何最好地利用集中式培训的许多挑战仍然存在。

One of these challenges is how to represent and use the action-value function that most RL methods learn. On the one hand, properly capturing the effects of the agents’ actions requires a centralised action-value function Qtot that conditions on the global state and the joint action. On the other hand, such a function is difficult to learn when there are many agents and, even if it can be learned, offers no obvious way to extract decentralised policies that allow each agent to select only an individual action based on an individual observation.
其中一个挑战是如何表示和使用大多数RL方法学习的动作值函数。一方面，正确地捕捉代理人的行动的影响需要一个集中的行动价值函数 Qtot ，它以全局状态和联合行动为条件。另一方面，当有许多代理时，这样的函数很难学习，即使它可以学习，也没有提供明显的方法来提取分散的策略，允许每个代理仅根据个人观察选择个人行动。

Figure 1:Decentralised unit micromanagement in StarCraft II, where each learning agent controls an individual unit. The goal is to coordinate behaviour across agents to defeat all enemy units.
图1：星际争霸II中分散的单元微观管理，每个学习代理控制一个单独的单元。目标是协调代理之间的行为，以击败所有敌方单位。

The simplest option is to forgo a centralised action-value function and let each agent a learn an individual action-value function Qa independently, as in independent Q-learning (IQL) (Tan, 1993). However, this approach cannot explicitly represent interactions between the agents and may not converge, as each agent’s learning is confounded by the learning and exploration of others.
最简单的选择是放弃一个集中的动作价值函数，让每个智能体 a 独立地学习一个单独的动作价值函数 Qa ，就像独立Q学习（IQL）一样（Tan，1993）。然而，这种方法不能显式地表示代理之间的交互，并且可能不会收敛，因为每个代理的学习都被其他代理的学习和探索所混淆。

At the other extreme, we can learn a fully centralised state-action value function Qtot and then use it to guide the optimisation of decentral

标签：II,QMIX,agent,学习,智能,action,centralised,单调
From： https://blog.csdn.net/wq6qeg88/article/details/137138935

QMIX：用于深度多智能体强化学习的单调值函数分解

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
QMIX：用于深度多智能体强化学习的单调值函数分解

Abstract 摘要

1 Introduction 引言

相关文章

赞助商

阅读排行

QMIX：用于深度多智能体强化学习的单调值函数分解

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning QMIX：用于深度多智能体强化学习的单调值函数分解

Abstract 摘要

1 Introduction 引言

相关文章

赞助商

阅读排行

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
QMIX：用于深度多智能体强化学习的单调值函数分解