注意力足矣（Attention Is All You Need）

标签：输出 Transformer 编码 Attention 编码器足矣 Need 注意力输入

文章目录

Transformer

Transformer架构
位置编码
编码器

多头自注意力
前馈网络

解码器

带掩码的多头自注意力
以编码器输出作为输入的多头自注意力
前馈网络

本文将介绍一个不使用卷积和循环网络层，而是完全基于注意力机制的模型——Transformer。提出这个模型的论文是Attention Is All You Need：

注意力足矣（Attention Is All You Need）_transformer

Transformer

一般注意力模型

自注意力（self-attention）和多头注意力（multi-head attention）

Transformer使用的注意力机制是多头自注意力，即将自注意力和多头注意力结合起来：

注意力足矣（Attention Is All You Need）_transformer_02

图片改自：[1]

下面将以Transformer为例展示多头自注意力，所以这里略过。

Transformer架构

Transformer总体架构如图：

注意力足矣（Attention Is All You Need）_机器学习_03

左边为编码器，右边为解码器， $注意力足矣（Attention Is All You Need）_深度学习_04$ 表示模块重复N次。

Transformer的编码器将输入 $注意力足矣（Attention Is All You Need）_Soft_05$ 进行编码，得到输入的表示 $注意力足矣（Attention Is All You Need）_机器学习_06$ 。

Transformer的解码器则是自回归的，即在生成下一符号时，使用先前生成的所有符号作为输入。给定 $注意力足矣（Attention Is All You Need）_机器学习_06$ ，解码器一次生成一个元素 $注意力足矣（Attention Is All You Need）_transformer_08$ 直到输出整个序列 $注意力足矣（Attention Is All You Need）_编码器_09$ 。

以机器翻译(中文翻译为英文)为例：

注意力足矣（Attention Is All You Need）_机器学习_10

（1）训练时

编码器

输入：生存还是毁灭
输出：编码Z

解码器

输入：BEGIN To be or not to be 和 Z（中间插入，如架构图所示）
目标：To be or not to be END

BEGIN和END为开始和结束的标识符。

（2）测试时

编码器

输入：生存还是毁灭
输出：编码Z

解码器自回归输入输出：

输入：BEGIN和Z（中间插入）
输出：To 

输入：BEGIN To 和Z（中间插入）
输出：be


输入：BEGIN To be 和Z（中间插入）
输出：or 

输入：BEGIN To be or和Z（中间插入）
输出： not 

输入：BEGIN To be or not和Z（中间插入）
输出： to 

输入：BEGIN To be or not to和Z（中间插入）
输出： be

输入：BEGIN To be or not to be和Z（中间插入）
输出：END
结束

下面三个小节将分别讲解Transformer的位置编码，编码器和解码器

位置编码

Transformer不包含卷积和循环模块。为了让模型能够利用输入的顺序，我们可以给输入的特征向量添加位置编码，增加一些位置信息。

令 $注意力足矣（Attention Is All You Need）_Soft_11$ 是输入向量 $注意力足矣（Attention Is All You Need）_Soft_12$ 的位置， $注意力足矣（Attention Is All You Need）_transformer_13$ 是输入向量的维度， $注意力足矣（Attention Is All You Need）_深度学习_14$ 表示输入向量的第 $注意力足矣（Attention Is All You Need）_深度学习_14$ 个维度。Transformer使用不同频率的正弦和余弦函数作为位置编码，输入向量 $注意力足矣（Attention Is All You Need）_Soft_12$ 的位置编码为：

$注意力足矣（Attention Is All You Need）_机器学习_17$

$注意力足矣（Attention Is All You Need）_Soft_18$

其中： $注意力足矣（Attention Is All You Need）_Soft_19$ ， $注意力足矣（Attention Is All You Need）_transformer_20$

以维度 $注意力足矣（Attention Is All You Need）_深度学习_14$ 为横坐标，位置 $注意力足矣（Attention Is All You Need）_Soft_11$ 为纵坐标，画出位置编码图像：

注意力足矣（Attention Is All You Need）_深度学习_23

图片来源：[7]

每一行为一个位置编码 $注意力足矣（Attention Is All You Need）_深度学习_24$ 。

这种位置编码除了可以表示向量的绝对位置信息，即不同输入向量有不同的位置编码，还包含相对位置信息：

$注意力足矣（Attention Is All You Need）_Soft_25$

其中

$注意力足矣（Attention Is All You Need）_机器学习_26$

即 $注意力足矣（Attention Is All You Need）_机器学习_27$ 和 $注意力足矣（Attention Is All You Need）_transformer_28$ 可以由 $注意力足矣（Attention Is All You Need）_transformer_29$ 和 $注意力足矣（Attention Is All You Need）_Soft_30$ 线性表示。

对于输入 $注意力足矣（Attention Is All You Need）_编码器_31$ ，位置编码 $注意力足矣（Attention Is All You Need）_深度学习_24$ 与向量 $注意力足矣（Attention Is All You Need）_Soft_12$ 具有相同的维度，所以可以将两者逐元素相加：

$注意力足矣（Attention Is All You Need）_编码器_34$

编码器

多头自注意力

注意力足矣（Attention Is All You Need）_机器学习_35

Transformer总共有 $注意力足矣（Attention Is All You Need）_transformer_36$ 个并行的自注意力层/头。每个自注意力头都有自己的可学习权重矩阵 $注意力足矣（Attention Is All You Need）_transformer_37$ 和 $注意力足矣（Attention Is All You Need）_Soft_38$ ， $注意力足矣（Attention Is All You Need）_编码器_39$ ，自注意力头 $注意力足矣（Attention Is All You Need）_Soft_40$ 的查询、键和值根据特征矩阵 $注意力足矣（Attention Is All You Need）_Soft_41$ 计算如下：

键(Key)矩阵：

$注意力足矣（Attention Is All You Need）_编码器_42$

值(Value)矩阵：

$注意力足矣（Attention Is All You Need）_深度学习_43$

查询(Query)矩阵：

$注意力足矣（Attention Is All You Need）_Soft_44$

以缩放点乘(Scaled Dot-Product)作为打分函数计算注意力得分：

$注意力足矣（Attention Is All You Need）_transformer_45$

以Softmax()作为对齐函数，计算注意力权重：

$注意力足矣（Attention Is All You Need）_编码器_46$

自注意力头 $注意力足矣（Attention Is All You Need）_Soft_40$ 输出为：

$注意力足矣（Attention Is All You Need）_机器学习_48$

将上述每个头部的自注意力计算过程总结为表达式：

$注意力足矣（Attention Is All You Need）_机器学习_49$

$注意力足矣（Attention Is All You Need）_深度学习_50$

我们的目标仍然是创建一个上下文向量作为注意力模型的输出。因此，要将各个注意力头产生的上下文向量被连接成一个向量 $注意力足矣（Attention Is All You Need）_机器学习_51$ 。然后，使用权重矩阵 $注意力足矣（Attention Is All You Need）_编码器_52$ 对其进行线性变换：

$注意力足矣（Attention Is All You Need）_机器学习_53$

对于每一个头，可以令 $注意力足矣（Attention Is All You Need）_深度学习_54$ 。由于每个头部的输出尺寸大小都 $注意力足矣（Attention Is All You Need）_transformer_55$ ，所以总计算成本与全尺寸 $注意力足矣（Attention Is All You Need）_transformer_56$ 单头注意的计算成本差不多。