首页 > 其他分享 >【基于价值分解网络的多智能体协同学习】【VDN】 【Value-Decomposition Networks For Cooperative Multi-Agent Learning】

【基于价值分解网络的多智能体协同学习】【VDN】 【Value-Decomposition Networks For Cooperative Multi-Agent Learning】

时间:2024-03-29 12:30:54浏览次数:35  
标签:Multi Learning MARL agent player Value learning team RL

目录

Value-Decomposition Networks For Cooperative Multi-Agent Learning

基于价值分解网络(VDN)的多智能体协同学习

Abstract 摘要

1 Introduction 引言

1.1 Other Related Work 

1.1 其他相关工作

2 Background 

2 背景

2.1 Reinforcement Learning

2.1 强化学习

​2.2 Deep Q-Learning 

2.2 深度 Q -学习

 2.3 Multi-Agent Reinforcement Learning

2.3 多智能体强化学习

3 A Deep-RL Architecture for Coop-MARL

3 A Coop-MARL的Deep-RL架构

4 Experiments 实验

4.1 Agents

4.2 Environments

4.3 Results

4.4 The Learned Q-Decomposition

4.4 Q -分解

5 Conclusions 

5 结论

Appendix A: Plots 

附录A:图

Appendix B: Diagrams 

附录B:图表

Value-Decomposition Networks For Cooperative Multi-Agent Learning

基于价值分解网络(VDN)的多智能体协同学习

https://arxiv.org/pdf/1706.05296.pdf

2017年6月16日提交

Abstract 摘要

        We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the “lazy agent” problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.
        研究了具有单一联合奖励信号的多智能体协作强化学习问题。这类学习问题是困难的,因为通常很大的组合动作和观察空间。在完全集中和分散的方法中,我们发现了虚假奖励的问题和我们称之为“懒惰代理”问题的现象,这是由于部分可观测性而产生的。我们解决这些问题,通过训练个人代理与一种新的价值分解网络架构,学会分解成代理明智的价值函数的团队价值函数。我们在一系列部分可观察的多智能体域进行了实验评估,并表明学习这种值分解会导致上级结果,特别是当与权重共享,角色信息和信息渠道相结合时。

1 Introduction 引言

We consider the cooperative multi-agent reinforcement learning (MARL) problem (Panait and Luke, 2005, Busoniu et al., 2008, Tuyls and Weiss, 2012), in which a system of several learning agents must jointly optimize a single reward signal – the team reward – accumulated over time. Each agent has access to its own (“local”) observations and is responsible for choosing actions from its own action set. Coordinated MARL problems emerge in applications such as coordinating self-driving vehicles and/or traffic signals in a transportation system, or optimizing the productivity of a factory comprised of many interacting components. More generally, with AI agents becoming more pervasive, they will have to learn to coordinate to achieve common goals.
我们考虑协作多智能体强化学习(MARL)问题(Panait和Luke,2005年,Busoniu等人,2008年,Tuyls和韦斯,2012年),其中几个学习代理的系统必须共同优化一个单一的奖励信号-团队奖励-随着时间的推移积累。每个代理都可以访问自己的(“本地”)观察结果,并负责从自己的动作集中选择动作。协调MARL问题出现在一些应用中,例如协调自动驾驶车辆和/或交通系统中的交通信号,或者优化由许多相互作用的组件组成的工厂的生产力。更一般地说,随着人工智能代理变得越来越普遍,它们将不得不学会协调以实现共同的目标。

Although in practice some applications may require local autonomy, in principle the cooperative MARL problem could be treated using a centralized approach, reducing the problem to single-agent reinforcement learning (RL) over the concatenated observations and combinatorial action space. We show that the centralized approach consistently fails on relatively simple cooperative MARL problems in practice. We present a simple experiment in which the centralised approach fails by learning inefficient policies with only one agent active and the other being “lazy”. This happens when one agent learns a useful policy, but a second agent is discouraged from learning because its exploration would hinder the first agent and lead to worse team reward.11For example, imagine training a 2-player soccer team using RL with the number of goals serving as the team reward signal. Suppose one player has become a better scorer than the other. When the worse player takes a shot the outcome is on average much worse, and the weaker player learns to avoid taking shots (Hausknecht, 2016).
例如,想象一下使用RL训练一个2人足球队,其中进球数作为球队奖励信号。假设一个球员比另一个球员得分更好。当较差的球员投篮时,结果平均要差得多,较弱的球员学会避免投篮(Hausknecht,2016)。
虽然在实践中,一些应用可能需要局部自治,但原则上可以使用集中式方法来处理合作MARL问题,将问题简化为级联观测和组合动作空间上的单智能体强化学习(RL)。我们表明,集中的方法始终失败相对简单的合作MARL问题在实践中。我们提出了一个简单的实验中,集中式的方法失败的学习效率低下的政策,只有一个代理活动和其他“懒惰”。当一个代理学习一个有用的策略,但第二个代理不鼓励学习,因为它的探索会阻碍第一个代理并导致更差的团队奖励时,就会发生这种情况。

For example, imagine training a 2-player soccer team using RL with the number of goals serving as the team reward signal. Suppose one player has become a better scorer than the other. When the worse player takes a shot the outcome is on average much worse, and the weaker player learns to avoid taking shots (Hausknecht, 2016).
例如,想象一下使用RL训练一个2人足球队,其中进球数作为球队奖励信号。假设一个球员比另一个球员得分更好。当较差的球员投篮时,结果平均要差得多,较弱的球员学会避免投篮(Hausknecht,2016)。

        An alternative approach is to train independent learners to optimize for the team reward. In general each agent is then faced with a non-stationary learning problem because the dynamics of its environment effectively changes as teammates change their behaviours through learning (Laurent et al., 2011)

标签:Multi,Learning,MARL,agent,player,Value,learning,team,RL
From: https://blog.csdn.net/wq6qeg88/article/details/137136253

相关文章

  • wpf write value to config file and read the persisted value
    <Windowx:Class="WpfApp26.MainWindow"xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"xmlns:d="http://schemas.microsoft.......
  • 【Learning eBPF-1】什么是 eBPF?为什么它很吊?
    本书中,eBPF被称为一种革命性的内核技术,被广泛应用于网络、观测和安全工具中。这种技术允许你在不重新编译内核的情况下,使能你的自定义工具,与内核数据进行交互。听起来很厉害。1.1追踪溯源,伯克利包过滤器eBPF的祖宗就是伯克利包过滤器,英文名:TheBerkeleyPacketFilter,......
  • 【Learning eBPF-0】引言
    本系列为《LearningeBPF》一书的翻译系列。(内容并非机翻,部分夹带私货)笔者学习自用,欢迎大家讨论学习。转载请联系笔者或注明出处,谢谢。各个章节内容:1)背景介绍:为啥eBPF很吊,以及内核如何支持这种超能力的。2)给出一个“HelloWorld”例子,介绍eBPF和`maps`的概念。3)深入......
  • 泛微e-cology_getE9DevelopAllNameValue2任意文件读取漏洞
    漏洞描述泛微e-cology依托全新的设计理念,全新的管理思想。为中大型组织创建全新的高效协同办公环境。智能语音办公,简化软件操作界面。身份认证、电子签名、电子签章、数据存证让合同全程数字化。泛微e-cologygetE9DevelopAllNameValue2接口存在任意文件读取漏洞,通过该漏洞......
  • T555Pulse 555做为多谐振荡器的计算器A calculator for multi harmonic oscillators
    本软件很便宜,就是2包烟的钱。可以用来计算555的普通多谐震荡器电路的电阻、电容、周期、频率、高电平时间,低电平时间、占空比这类的东西的相互换算。Thissoftwareisverycheap,itcostsonly2packsofcigarettes.Itcanbeusedtocalculatethemutualconversionofr......
  • [Paper Reading] LVM: Sequential Modeling Enables Scalable Learning for Large Vis
    LVM:SequentialModelingEnablesScalableLearningforLargeVisionModelsLVM:SequentialModelingEnablesScalableLearningforLargeVisionModels时间:23.12机构:UCBerkeley&&JohnsHopkinsUniversityTL;DR本文提出一种称为大视觉模型(LVM)的方法,该方法以"vis......
  • RT路由器 serial 口ppp multilink 绑定接口配置
    配置MutlilinkPPP捆绑,编号为1interfacemultilink1                   ipadd192.168.100.1255.255.255.0   pppmultilink                                ......
  • [转帖]SPECjbb MultiJVM - Java Performance
     MovingonfromSPECCPU,weshiftovertoSPECjbb2015.SPECjbbisafromground-updevelopedbenchmarkthataimstocoverbothJavaperformanceandserver-likeworkloads,fromtheSPECwebsite:“TheSPECjbb2015benchmarkisbasedontheusagemodelofa......
  • 解决“AttributeError: ‘numpy.ndarray’ object has no attribute ‘value_counts’
    成功解决AttributeError:‘numpy.ndarray’objecthasnoattribute‘value_counts’大家好,今天我想分享一个我在Python编程过程中遇到的问题,并详细阐述我是如何解决的。这个问题是关于numpy.ndarray对象没有value_counts属性的AttributeError。一、问题背景与错误描述......
  • Key-N-Value--基于Protocol Buffers的树型协议处理引擎
    导言KNV是Key-Value协议的无限嵌套和模式自由的扩展,允许使用者快速访问或修改ProtoBuffers协议中的一部分或者多个部分,KNV原是一个面向对象缓存系统的一部分,后面作为腾讯第一批开源组件对外开源。KNV的理念也申请并获得国家专利。​​​​​​​项目地址:GitHub-shaneyuee/......