首页 > 其他分享 >CPEN400D 深度学习

CPEN400D 深度学习

时间:2023-04-08 09:02:07浏览次数:39  
标签:positional Transformer CPEN400D encoding 深度 module 学习 token each


CPEN400D: Deep Learning
Programming Assignment 2
Released: 2023 Mar. 24
In this assignment, you will implement a Transformer [1] architecture from scratch. A Colab
notebook is provided with this documentation, which you need to complete. Please pay attention
to the following notes:
The notebook is available at this link.
You are only allowed to modify the code inside the START/END blocks.
For each module, one test case is provided at the end of the notebook. You may use the
test cases to get a sense of how the input/output formats are, and as a sanity check for your
implementation. For each test case, your code must give a response within 10 seconds.
Make sure your code is readable (by leaving comments and writing self-commented code).
Unless otherwise specified, you are not allowed to use any module from torch.nn module.
For each subsection, there is a corresponding START/END block in the provided colab that
you need to fill in.
All the code must be written by yourself, and you are not allowed to copy any code from
other sources such as your classmates or the internet.
If you use ChatGPT or other LLMs to help finish the assignment, please clearly mark which
questions you use them. We may require you to submit your prompts. Please make sure you
store them properly.
1 Transformer
In this problem, we will implement a Transformer from scratch and train it on a toy task. The setup
of the problem is as follows: We are given training/validation/test sets of size Ntrain, Nval, Ntest,
respectively. Each dataset contains a set of strings consisting of four English alphabet characters
“c”,“p”,“e”, and “n”. The task is to verify if the sequence contains “cpen” as a contiguous substring
or not. For instance “ecpenp” contains this substring and has label 1, while “cpeeen” does not
contain the substring and has label 0. Each string si (1 ≤ i ≤ N) in each dataset contains sequences
of length n. For exact details of this task, you may refer to the implementation of SubstringDataset
in the notebook. An illustrative figure of this task is shown in Figure 1. Throughout this assignment,
we omit the batch size in the equations for simplicity but your implementations must take into
account the batch size.
1
UBC CPEN400D 2022 Winter Term 2 with Professor Renjie Liao Assignment 2
Transformer
[CLS] e c p e n
1
p
Figure 1: Substring matching using a Transformer. A string is first tokenized into a sequence of
characters. A “[CLS]” token is added to the beginning of the sequence. The sequence is encoded
and passed through a Transformer. Finally, the output of the “[CLS]” token is decoded as the class
label. The label is 1 if the string is matched, 0 otherwise.
Preprocessing and Tokenization
First, we need to convert our dataset, which is represented as string data, to a vectorized format.
To this end, we first need to break the string into small pieces. This process is called Tokenization.
Next, we have to convert each of the small pieces into a vectorized format, usable by a neural
network.
1.1 [10pts] Implement the tokenization and vectorization functionality in the Tokenizer class.
This function takes in a string, (1) splits the string, and returns a list of characters (or tokens
in Transformer terminology) (2) (optionally) add a “[CLS]” token to the beginning of the list
(later, we will see why we need this). (3) convert each token into a one-hot vector and return the
resulting matrix. For a string of length n, this function must return a row-wise one-hot matrix
X ∈ R(n+1)×dvoc where dvoc = 4 + 1 = 5 is the size of our token vocabulary.
Positional Encoding
The Transformer model is designed to process sequences of tokens, but it lacks any inherent under-
standing of the order or position of those tokens within the sequence. This means that the model
would treat the same sequence of tokens differently depending on their order within the input. To
overcome this limitation, the Transformer architecture introduces positional encoding.
Positional encoding is a way of adding information about the position of each token in the
sequence to the input embeddings before passing them through the Transformer layers. This allows
the model to distinguish between tokens based on their position in the sequence.
There are various forms of imposing positional encoding such as sinusoidal positional encoding
introduced in the original Transformer paper. Here, we will implement another version, which
is learnable positional encoding. The idea behind learnable positional encoding is to dedicate a
learnable vector to each position index i, instead of a fixed sinusoidal vector.
1.2 [5pts] Implement the (absolute) learnable positional encoding module. This module has a
learnable weight matrix Wpos ∈ RdmaxLen×dmodel . The i-th row (1 ≤ i ≤ dmaxLen) of this matrix is
2
UBC CPEN400D 2022 Winter Term 2 with Professor Renjie Liao Assignment 2
a learnable vector corresponding to the i-th position in a sequence. This module applies positional
encoding to a sequence by element-wise adding rows of this matrix to their corresponding position
in the input.
Multi-Head Attention
Attention is the key component of the Transformer, which we will implement next. Attention
allows the model to attend to different parts of the input sequence with different weights, enabling
it to capture complex relationships between the input tokens. This module consists of numHeads
heads, each head h consists of three weight matrices: WK,h,WQ,h,WV,h ∈ Rdmodel×dh , where
dh = dmodel/dh. Multi-head attention takes three matrices XK ,XQ,XV ∈ Rn×dmodel as input,
where n is the number of tokens. The computational mechanism of each head is as follows:
headh = Softmax,
where Sofotmax is a row-wise softmax operator. The result of each head is a n × dh matrix. To
merge the outputs of heads and obtain a final output of size n × dmodel, heads are concatenated
and multiplied by a weight matrix WO ∈ Rdmodel×dmodel :
Attention(XK ,XQ,XV ) = Concat(head1,head2, . . . ,headH)WO.
In this assignment (and in Transformer Encoder in general), we use Self-attention where XK =
XQ = XV = X are all the same matrices.
1.3 [20pts] Implement the multi-head attention module.
1.5 [5pts] An alternative way of imposing positional encoding in the Transformer is to apply it in
the attention Softmax operator. That is, modify the equation of each head as follows:. In each head, this positional
encoding adds a scalar value of mi?j,h to the attention score of each pair of tokens with indices
i, j (1 ≤ i, j ≤ n). Since in this positional encoding, only the relative position of each pair of tokens
is important, this positional encoding is called Relative Positional Encoding (RPE). Implement
relative positional encoding (RPE) in the multi-head attention module.
3
UBC CPEN400D 2022 Winter Term 2 with Professor Renjie Liao Assignment 2
Transformer Backbone
Next, we will proceed with the implementation of Transformer layers and model. Each layer of the
transformer contains four components: self-attention, feed-forward layer, and normalization. The
feed-forward layer is a two-layer fully connected network with ReLU activation. It contains two
weight matrices W1 ∈ Rdmodel×4dmodel ,W2 ∈ R4dmodel×dmodel , and its output is given by:
FC(X) = ReLU(XW1)W2.
1.4 [15pts] Implement the transformer layer module. The prenorm parameter determines whether
the transformer layer is Pre-Norm or Post-Norm.
1.5 [15pts] Implement the transformer model module. This module contains an encoder, nLayers
layers of transformer layer, and a decoder. The encoder weightWenc ∈ Rdinput×dmodel is a linear layer
that takes in tokens of dimension dinput = dvocab and encodes each of them into a high-dimensional
space of dmodel. After applying the required layers of the Transformer layer, the decoder weight
matrix Wdec ∈ Rdmodel×dout decodes each token from a high-dimensional space into the output
space.
Optimization Scheduling
Next, we implement the optimizer and its scheduler. We choose Adam optimizer as the optimizer.
Transformers often require warm-up and cool-down scheduling for stable training and better/faster
generalization. That is, in the early steps of optimization, the learning rate is gradually increased
from zero to the base learning rate, and in the final stages of optimization, the learning rate is
decreased to zero again. Here, we implement a simple scheduler with linear warmup and cooldown.
1.6 [5pts] Implement a scheduler with warmUp warmup steps and maxSteps total number of steps.
In other words, your scheduler must return a zero learning rate at step 0, increase the learning rate
linearly to lr until step warmUp, and decrease it again linearly to zero until step maxSteps.
Train Substring Matching
Finally, we have all the pieces necessary to train a substring matching model using the Transformer
architecture. The Trainer class contains a minimal training and evaluation procedure for this task.
1.7 [15pts] Implement the loss computation function in the Trainer class. The Transformer contains
n output vectors for input with n tokens. However, our task is to have a single output for the entire
sequence (i.e. the label). Hence, we have added a dummy token called “[CLS]” token, and will only
look at the output of this token to compute the loss and accuracy. The loss/accuracy computation
function should compute the predicted label, and binary cross entropy with the ground truth labels,
and return both the loss and accuracy for a given batch. You may use the cross entropy loss function
from PyTorch to compute loss.
1.8 [10pts] Run the training procedure to make sure your code is correct and the model trains
properly. In case you were not successful in implementing some of the previous parts, you may use
a module from PyTorch (e.g., Attention, Transformer, etc) to implement this part and receive the
full points for this part. If you have implemented everything correctly, your code should achieve
close to perfect accuracy on the test set.

WX:codehelp mailto: [email protected]

标签:positional,Transformer,CPEN400D,encoding,深度,module,学习,token,each
From: https://www.cnblogs.com/qujava/p/17297886.html

相关文章

  • 学习C语言第六天
    一.多维数组元素的地址#include<stdio.h>intmain(){intarr[3][4]={{11,22,33,44},{12,13,15,16},{22,66,77,88}};inti;intj;for(i=0;i<3;i++){for(j=0;j<4;j++){printf("add:0x%p,data:%d",&arr[i......
  • #yyds干货盘点#学习笔记(1)Linux和Windows上实现端口映射
    一、Windows下实现端口映射1.查询端口映射情况netshinterfaceportproxyshowv4tov42.查询某一个IP的所有端口映射情况netshinterfaceportproxyshowv4tov4|find"[IP]"例:netshinterfaceportproxyshowv4tov4|find"192.168.1.1"3.增加一个端口映射netshinterfa......
  • docker学习
    Docker是一个开源的应用容器引擎,它可以让开发者将应用程序及其依赖项打包到一个轻量级、可移植的容器中,然后发布到任何支持Docker的环境中,以消除“在我电脑上可以运行,在你电脑上不能运行”的问题。以下是Docker的基本使用方法:安装Docker:首先,您需要在您的系统上安装Doc......
  • 删边最短路学习笔记
    删边最短路前言删边最短路是一种科技,用于解决一类问题:给定非负权图\(G=(V,E)\)。设\(n=|V|\),保证\(1\)可达\(n\)。设\(\Delta(e)\)为图\(G'=(V,E\setminus\{e\})\)上\(1\rightsquigarrown\)的最短路,若\(G'\)上\(1\)不可达\(n\)则为\(+\infty\)......
  • 强化学习笔记
    1.1.简介强化学习(reinforcementlearning)是机器学习的一个重要分支,其具有两个重要的基本元素:状态和动作。类似于编译原理中的自动机,或数据结构中的AOE图,强化学习研究的就是怎样找到一种最好的路径,使得不同状态之间通过执行相应动作后转换,最终到达目标状态。先介绍几个名词:状态......
  • 内存马学习
    内存马介绍webshell的变迁过程大致如下所述:web服务器管理页面——>大马——>小马拉大马——>一句话木马——>加密一句话木马——>加密内存马 内存马是无文件攻击的一种常用手段,传统的文件上传的webshll或以文件形式驻留的后门越来越容易被检测到,内存马使用越来越多。传统......
  • SpringBoot项目学习总结
    1.项目包结构一共有6个包,common包下的主要是常量和返回结果的结构。2.创建实体类将sql语句复制过来,按住ALT+鼠标左键竖直选中删除,按HOME和END到所有行的头和尾同时编辑。3.三层开发规范分别是Controller/Service/Dao,顺序:前端浏览器->Controller->Service(接口、实现类......
  • 4.7软件工程学习总结
    今天只有晚上有时间自习,然后开始实现注册信息的功能,向mysql数据库里添加数据,之前的地铁项目做的是查询,没有做过添加数据的功能,今天写出了部分后台代码。 ......
  • Java学习路径
    一、Java学习路径   1.JavaSE  2.数据库   3.前端  4.JavaWeb  5.SSM框架  6.Linux  7.SpringBoot  8.SpringCloud  9.Hadoop......
  • 基于Python的机器学习算法——sklearn模块
    基于Python的机器学习算法安装包:pipinstallnumpy#安装numpy包pipinstallsklearn#安装sklearn包importnumpyasnp#加载包numpy,并将包记为np(别名)importsklearn#加载sklearn包python中的基础包:numpy:科学计算的基础库,包括多维数组处理、线性代数等pandas:主......