Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks
阅读这篇文章的目的是为了了解hypernet,相关代码 https://github.com/rabeehk/hyperformer
parameter-efficient fine-tuning的方法依赖于引入一个adapter module。这篇文章使用一个shared hypernetworks来为每一个tasks和每一个layer中生成adapter,which is condition on task、adapter position、layer id in a transformer model.
Introduction
预训练的LLM表现出来很好的效果:Transfer learning from pretrained large-scale language models yields state-of-the-art results in a variety of tasks (Devlin et al., 2019; Radford et al., 2018; Liu et al., 2019b).
为什么这篇文章的方法能够起作用?
The hypernetwork is jointly learned between all tasks and is thus able to share information across them, while negative interference is minimized by generating separate adapter layers for each task. For each new task, our model only requires learning an additional task embedding, reducing the number of trained parameters.
这篇文章的主要贡献有哪些?
- 提出了这么一个框架(实际上我觉得就是把hypernetwork拿过来用了一下
- 本文的方法要比之前的方法要好
- 本文的方法在GLUE上完成了验证
- 在unseen in-domain tasks 上进一步分析这个方法
Method
Task Conditional Adapter Layers
2.1这一部分实际上就是采用了类似hypernet的结构,在hypernetworks那篇文章中用的是一个embedding vector用来描述给定layer的整个weights的信息,但是在这一部分中,作者采用的是使用一个Embedding来描述输入任务的信息。
In this work, we propose conditional adapter modules, in which we generate the adapters weights based on input task embeddings using shared hypernetworks (Ha et al., 2017), which capture information across tasks that can be used to positively transfer to other relevant tasks.
Task Conditional Layer Normalization
这一部分也是作为一个函数,来将task Embeddings的信息生成两个参数:
Task Conditioned Hypernetworks
这部分定义的是Hypernetworks这一部分,介绍了一下在这篇文章中是怎么使用Hypernetworks的。
Hyperformer++ 和 Hyperformer之间的区别?
其实这就是在Hyperformer的基础上,补充了每个task的信息、每个adapter的位置、每个Transformer中的layer id
This way, the hypernetwork is able to produce distinct weights for each task, adapter position, and layer of a transformer. Furthermore, layer id and adapter position embeddings are parameters that are learned via back-propagation, allowing us to train the whole model end-to-end conveniently.