目录
Value-Decomposition Networks For Cooperative Multi-Agent Learning
2.3 Multi-Agent Reinforcement Learning
3 A Deep-RL Architecture for Coop-MARL
4.4 The Learned Q-Decomposition
Value-Decomposition Networks For Cooperative Multi-Agent Learning
基于价值分解网络(VDN)的多智能体协同学习
https://arxiv.org/pdf/1706.05296.pdf
2017年6月16日提交
Abstract 摘要
We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the “lazy agent” problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.
研究了具有单一联合奖励信号的多智能体协作强化学习问题。这类学习问题是困难的,因为通常很大的组合动作和观察空间。在完全集中和分散的方法中,我们发现了虚假奖励的问题和我们称之为“懒惰代理”问题的现象,这是由于部分可观测性而产生的。我们解决这些问题,通过训练个人代理与一种新的价值分解网络架构,学会分解成代理明智的价值函数的团队价值函数。我们在一系列部分可观察的多智能体域进行了实验评估,并表明学习这种值分解会导致上级结果,特别是当与权重共享,角色信息和信息渠道相结合时。
1 Introduction 引言
We consider the cooperative multi-agent reinforcement learning (MARL) problem (Panait and Luke, 2005, Busoniu et al., 2008, Tuyls and Weiss, 2012), in which a system of several learning agents must jointly optimize a single reward signal – the team reward – accumulated over time. Each agent has access to its own (“local”) observations and is responsible for choosing actions from its own action set. Coordinated MARL problems emerge in applications such as coordinating self-driving vehicles and/or traffic signals in a transportation system, or optimizing the productivity of a factory comprised of many interacting components. More generally, with AI agents becoming more pervasive, they will have to learn to coordinate to achieve common goals.
我们考虑协作多智能体强化学习(MARL)问题(Panait和Luke,2005年,Busoniu等人,2008年,Tuyls和韦斯,2012年),其中几个学习代理的系统必须共同优化一个单一的奖励信号-团队奖励-随着时间的推移积累。每个代理都可以访问自己的(“本地”)观察结果,并负责从自己的动作集中选择动作。协调MARL问题出现在一些应用中,例如协调自动驾驶车辆和/或交通系统中的交通信号,或者优化由许多相互作用的组件组成的工厂的生产力。更一般地说,随着人工智能代理变得越来越普遍,它们将不得不学会协调以实现共同的目标。
Although in practice some applications may require local autonomy, in principle the cooperative MARL problem could be treated using a centralized approach, reducing the problem to single-agent reinforcement learning (RL) over the concatenated observations and combinatorial action space. We show that the centralized approach consistently fails on relatively simple cooperative MARL problems in practice. We present a simple experiment in which the centralised approach fails by learning inefficient policies with only one agent active and the other being “lazy”. This happens when one agent learns a useful policy, but a second agent is discouraged from learning because its exploration would hinder the first agent and lead to worse team reward.11For example, imagine training a 2-player soccer team using RL with the number of goals serving as the team reward signal. Suppose one player has become a better scorer than the other. When the worse player takes a shot the outcome is on average much worse, and the weaker player learns to avoid taking shots (Hausknecht, 2016).
例如,想象一下使用RL训练一个2人足球队,其中进球数作为球队奖励信号。假设一个球员比另一个球员得分更好。当较差的球员投篮时,结果平均要差得多,较弱的球员学会避免投篮(Hausknecht,2016)。
虽然在实践中,一些应用可能需要局部自治,但原则上可以使用集中式方法来处理合作MARL问题,将问题简化为级联观测和组合动作空间上的单智能体强化学习(RL)。我们表明,集中的方法始终失败相对简单的合作MARL问题在实践中。我们提出了一个简单的实验中,集中式的方法失败的学习效率低下的政策,只有一个代理活动和其他“懒惰”。当一个代理学习一个有用的策略,但第二个代理不鼓励学习,因为它的探索会阻碍第一个代理并导致更差的团队奖励时,就会发生这种情况。
For example, imagine training a 2-player soccer team using RL with the number of goals serving as the team reward signal. Suppose one player has become a better scorer than the other. When the worse player takes a shot the outcome is on average much worse, and the weaker player learns to avoid taking shots (Hausknecht, 2016). 例如,想象一下使用RL训练一个2人足球队,其中进球数作为球队奖励信号。假设一个球员比另一个球员得分更好。当较差的球员投篮时,结果平均要差得多,较弱的球员学会避免投篮(Hausknecht,2016)。 |
An alternative approach is to train independent learners to optimize for the team reward. In general each agent is then faced with a non-stationary learning problem because the dynamics of its environment effectively changes as teammates change their behaviours through learning (Laurent et al., 2011)
标签:Multi,Learning,MARL,agent,player,Value,learning,team,RL From: https://blog.csdn.net/wq6qeg88/article/details/137136253