首页 > 其他分享 >Transformer学习-最简DEMO实现字符串转置

Transformer学习-最简DEMO实现字符串转置

时间:2024-04-02 21:33:00浏览次数:29  
标签:Transformer seq 转置 DEMO self mask len hidden size

Transformer学习-最简DEMO实现字符串转置


背景:调试AI加速卡在Pytorch训练时的精度问题,搭建了一个简单的Transformer,设置随机种子,保证每次重训练loss完全一致,可以直接对比各算子的计算误差

一.代码

import os
import random
import numpy as np
import torch
from torch.utils.data import DataLoader,Dataset
import math
import torch
from torch.nn import Module, Linear
import torch.nn as nn
import torch
import torch.nn as nn
import pickle
import copy

#初始化随机种子,保证可复现
random.seed(1)
np.random.seed(1)
torch.random.manual_seed(1)

'''
功能:将字符串进行如下操作
y=np.array(x).reshape(2,-1).transpose(1,0).reshape(-1)
'''

# 定义编码规则
vocabulary = '<PAD>,1,2,3,4,5,6,7,8,9,0,<SOS>,<EOS>'
vocabulary = {word: i for i, word in enumerate(vocabulary.split(','))} #{"1":0,"2":1,...}
vocabulary_values = [k for k, v in vocabulary.items()] #用于将预测的结果(token)转换了字符

def tokenlize(x,max_seq_len):
    '''输入字符串转token,首尾加了标记符,padding到最大seqlen''' 
    x = ['<SOS>'] + list(x) + ['<EOS>'] + ['<PAD>'] * (max_seq_len  - len(x) -2)
    x = [vocabulary[i] for i in x]
    return x

def generate_sample(index,max_seq_len):
    '''生成数据集'''
    words = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] 
 
    count = random.randint(10, 20)
    x1 = np.random.choice(words, size=count, replace=True, p=None).tolist()    
    x2 = np.random.choice(words, size=count, replace=True, p=None).tolist()
    x  = x1+x2
    y = np.array(x).reshape(2,-1).transpose(1,0).reshape(-1)
    y = list(y)
 
    return tokenlize(x,max_seq_len), tokenlize(y,max_seq_len+1)

class UserDataset(Dataset):
    '''为了验证每次训练过程,是否有计算误差,将数据集保存下来'''
    def __init__(self,max_seq_len):
        super(UserDataset,self).__init__()
        self.max_seq_len=max_seq_len
        self.sample_count=100000
        cache_path="training_dataset.pkl"
        self.record={}
        self.record['x']=[]
        self.record['y']=[]

        if not os.path.exists(cache_path):
            for i in range(self.sample_count):
                x,y=generate_sample(i,self.max_seq_len)
                self.record['x'].append(x)
                self.record['y'].append(y)
            with open(cache_path,'wb') as f:
                pickle.dump(self.record,f)
        else:
            with open(cache_path,'rb') as f:
                self.record=pickle.load(f)

    def __len__(self):
        return self.sample_count

    def __getitem__(self, item):
        x,y=self.record['x'][item],self.record['y'][item]
        return torch.LongTensor(x),torch.LongTensor(y)

def clones(module, N):  
    '''复制多个模型'''
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])  

class MultiheadAttention(Module):
    '''多头注意力机制'''
    def __init__(self,max_seq_len,hidden_size,head_num):
        super().__init__()
        self.max_seq_len=max_seq_len
        self.hidden_size=hidden_size
        self.head_num=head_num
        self.linears = clones(Linear(self.hidden_size, self.hidden_size),4) #分别对比q,k,v,和输出矩阵
        # 对最后一个维度(hidden)进行归一化
        self.norm = nn.LayerNorm(normalized_shape=self.hidden_size, elementwise_affine=True)
        self.dropout = nn.Dropout(0.1)

    def attention(self,q, k, v, mask):
        #输入shape: [batch,head_num,seq,hidden]
        #q,k 通过计算hidden的相似度得到seq之间的相关性 [batch,head_num,seq,seq]
        batch,head_num,seq,seq=q.shape
        score = torch.matmul(q, k.permute(0, 1, 3, 2)) 

        # 除以每个头维数的平方根,做数值缩放
        score /= head_num ** 0.5

        # mask遮盖,mask是True的地方都被替换成-inf,经过softmax之后就为0
        # b,head_num,输出token,其它token的关系(加权求和)
        score = score.masked_fill_(mask, -float('inf'))
        score = torch.softmax(score, dim=-1)

        # 注意力分数乘以V,得到最终的注意力结果
        # [batch,head_num,seq,seq] * [batch,head_num,seq,hidden] = [batch,head_num,seq,hidden]
        attn = torch.matmul(score, v)

        # [batch,seq,head_num,hidden] -> [batch,seq,head_num*hidden] 
        attn = attn.permute(0, 2, 1, 3).reshape(-1, self.max_seq_len, self.hidden_size)
        return attn

    def forward(self, q, k, v, mask):
        batch = q.shape[0]

        # 保留下原始的q
        clone_q = q.clone()

        q = self.linears[0](self.norm(q))
        k = self.linears[1](self.norm(k))
        v = self.linears[2](self.norm(v))

        # 拆分成多个多个头 [batch,head_num,seq,hidden]
        q = q.reshape(batch, -1, self.head_num, self.hidden_size//self.head_num).permute(0, 2, 1, 3) 
        k = k.reshape(batch, -1, self.head_num, self.hidden_size//self.head_num).permute(0, 2, 1, 3)
        v = v.reshape(batch, -1, self.head_num, self.hidden_size//self.head_num).permute(0, 2, 1, 3)

        #计算attention
        attn = self.attention(q, k, v, mask)

        return clone_q + self.dropout(self.linears[3](attn))

class PositionalEmbedding(Module):
    '''Embedding+位置编码,所有序列同时计算,需要将位置信息嵌入输入向量中'''
    def __init__(self,vocab_size,hidden_size,max_seq_len):
        super().__init__()
        # pos是第几个词,i是第一个维度,d_model是维度总数
        def get_pre(pos, dim, d_model):
            fenmu = 1e4 ** (dim / d_model)
            posmask = pos / fenmu
            if dim % 2 == 0:
                return math.sin(posmask)
            return math.cos(posmask)

        # 保证每个token的每个特征都有不同的位置,并且它们的位置都有一定的关系,还不能让值过大
        pe = torch.empty(max_seq_len, hidden_size)
        for i in range(max_seq_len):
            for j in range(hidden_size):
                pe[i,j] = get_pre(i, j, hidden_size)
        pe = pe.unsqueeze(0) #添加batch维度
        # 定义为不更新的常量
        self.register_buffer('pe', pe)
        self.embed = nn.Embedding(vocab_size, hidden_size)
        self.embed.weight.data.normal_(0, 0.1)

    def forward(self, x):
        # 输入:[batch,seqlen] -> [batch,seqlen,hidden]
        embed = self.embed(x)
        # 词编码和位置编码相加(广播加)
        # [batch,seqlen,hidden] + [1,seqlen,hidden] -> [batch,seqlen,hidden]
        embed = embed + self.pe
        return embed

class FeedForward(Module):
    '''二层全链接,升维再降维度'''
    def __init__(self,hidden_size,ffn_size):
        super().__init__()
        self.fc = nn.Sequential(
            Linear(in_features=hidden_size, out_features=ffn_size),
            nn.ReLU(),
            Linear(in_features=ffn_size, out_features=hidden_size),
            nn.Dropout(0.1)
        )
        self.norm = nn.LayerNorm(normalized_shape=hidden_size, elementwise_affine=True)

    def forward(self, x):
        return x + self.fc(self.norm(x))

def mask_pad(input_token):
    '''
        1.生成输入token的mask,将PAD标记的位置设置为True
        2.在计算attention时score.masked_fill_(mask, -float('inf'))将改位置的值设置为inf,不关注该位置    
    '''
    batch,seqlen=input_token.shape
    mask = input_token == vocabulary['<PAD>']
    mask = mask.reshape(batch,1,1,seqlen) # 变成multi_head的形态
    # [batch, 1, 1, seqlen] -> [batch, 1, seqlen, seqlen] 生成seqlen*seqlen注意力矩阵的mask
    mask = mask.expand(batch, 1, seqlen, seqlen)
    return mask

def mask_tril(input_token):
    '''
    这decoder的输入生成mask
    1.为什么decoder的mask跟encoder不一样呢:为了解决训练和测试二个阶段的gap
    2.测试时从SOS开始,循环解码,直到输出EOS为止,每一次decoder只能看到之前的token
    3.因此,decoder的mask,在考虑是否为PAD位的同时,还需要考虑能看到的范围
    4.因为mask为true是设置attention score中相应的位置为inf,最后用做左矩阵
      每一行乘加v矩阵的一列
    5.因此,mask从第一行到最后一行,被mask的依次减少,类似这样
        False,  True,  True
        False, False,  True
        False, False, False
    '''
    batch,seqlen=input_token.shape

    tril = 1 - torch.tril(torch.ones(1,seqlen,seqlen,dtype=torch.long)) #[1,seqlen,seqlen]
    mask = input_token == vocabulary['<PAD>']   #[1,seqlen]
    mask = mask.unsqueeze(1).long()             #[1,1,seqlen]

    # mask和tril求并集
    mask = mask + tril
    mask = mask>0                               #[1,seqlen,seqlen]
    mask = (mask==1).unsqueeze(dim=1)           #[batch,1,seqlen,seqlen]
    return mask

class EncoderLayer(nn.Module):
    def __init__(self,max_seq_len,hidden_size,head_num,ffn_size):
        super().__init__()
        self.mh = MultiheadAttention(max_seq_len,hidden_size,head_num)
        self.fc = FeedForward(hidden_size,ffn_size)
    def forward(self,x,mask):
        score = self.mh(x,x,x,mask)
        out = self.fc(score)
        return out

class Encoder(nn.Module):
    def __init__(self,max_seq_len,hidden_size,head_num,ffn_size):
        super().__init__()
        self.layers = clones(EncoderLayer(max_seq_len,hidden_size,head_num,ffn_size), 3) #3层encoder
    def forward(self,x,mask):
        for layer in self.layers:
            x = layer(x,mask)
        return x
    
class DecoderLayer(nn.Module):
    def __init__(self,max_seq_len,hidden_size,head_num,ffn_size):
        super().__init__()
        self.mh1 = MultiheadAttention(max_seq_len,hidden_size,head_num)
        self.mh2 = MultiheadAttention(max_seq_len,hidden_size,head_num)
        self.fc = FeedForward(hidden_size,ffn_size)

    def forward(self,x,y,mask_pad_x,mask_tril_x):
        # 先对decoder的输入做attention
        y = self.mh1(y,y,y,mask_tril_x)
        # 上面的输出与encoder的输出计算相关性矩阵 来控制encoder的输入
        y = self.mh2(y,x,x,mask_pad_x)
        y = self.fc(y)
        return y

class Decoder(nn.Module):
    def __init__(self,max_seq_len,hidden_size,head_num,ffn_size):
        super().__init__()
        self.layers = clones(DecoderLayer(max_seq_len,hidden_size,head_num,ffn_size), 1)
    def forward(self,x,y,mask_pad_x,mask_tril_x):
        for layer in self.layers:
            x = layer(x,y,mask_pad_x,mask_tril_x)
        return x

class Transformer(nn.Module):
    def __init__(self,vocab_size,max_seq_len,hidden_size,head_num,ffn_size):
        super().__init__()
        self.embed_x = PositionalEmbedding(vocab_size,hidden_size,max_seq_len)
        self.embed_y = PositionalEmbedding(vocab_size,hidden_size,max_seq_len)
        self.encoder = Encoder(max_seq_len,hidden_size,head_num,ffn_size)
        self.decoder = Decoder(max_seq_len,hidden_size,head_num,ffn_size)
        self.fc_out = nn.Linear(hidden_size,vocab_size)

    def forward(self,x,y,mask_pad_x,mask_tril_x):
        x,y = self.embed_x(x),self.embed_y(y)
        x = self.encoder(x,mask_pad_x)
        y = self.decoder(x,y,mask_pad_x,mask_tril_x)
        y = self.fc_out(y)
        return y

def predict(model,x,max_seq_len):
    model.eval()
    mask_pad_x = mask_pad(x).to("cuda")

    #初始化输出
    target = [vocabulary['<SOS>']] + [vocabulary['<PAD>']] * (max_seq_len-1)  
    target = torch.LongTensor(target).unsqueeze(0) 

    x = model.embed_x(x.to("cuda"))
    x = model.encoder(x, mask_pad_x)

    # 从第1个位置开始生成,遇到EOS退出
    for i in range(max_seq_len-1):
        y = target
        mask_tril_y = mask_tril(y).to("cuda")
        y = model.embed_y(y.to("cuda"))
        y = model.decoder(x, y, mask_pad_x, mask_tril_y)
        out = model.fc_out(y) #[batch,seqlen,vocab_size]

        # 只取出当前位置
        out = out[:, i, :] #[batch,vocab_size]

        # 生到token
        out = out.argmax(dim=1).detach()
        if out==vocabulary['<EOS>']:
            break
        # 以当前词预测下一个词
        target[:, i + 1] = out    
    return target

def val(model,max_seq_len):
    input="12345678900987654321"
    x=torch.LongTensor(tokenlize(input,max_seq_len)).unsqueeze(0)

    gt=np.array(list(input)).reshape(2,-1).transpose(1,0).reshape(-1)
    gt="".join(gt)
    print("Gt:  ",gt)
    pred=np.array([i for i in predict(model,x,max_seq_len)[0].tolist()][1:])
    pred = pred[pred != vocabulary['<PAD>']]
    pred =''.join([vocabulary_values[i] for i in pred])

    print("Pred:",pred)
    return gt==pred

def train(model,vocab_size,max_seq_len):
    loss_func = nn.CrossEntropyLoss()
    optim = torch.optim.Adam(model.parameters(), lr=1e-3)
    sched = torch.optim.lr_scheduler.StepLR(optim, step_size=3, gamma=0.5)

    loader = DataLoader(UserDataset(max_seq_len),
                        batch_size=32,
                        drop_last=True, # 丢弃掉最后batchsize少于一个epoch的样本数量
                        shuffle=False,   # 保底每次训练的loss曲线都一致
                        collate_fn=None) 

    for epoch in range(100):
        model.train()
        for i, (input, gt) in enumerate(loader):
            # https://zhuanlan.zhihu.com/p/662455502?utm_id=0
            '''
            decoder的输入是正确答案,在训练的时候我们会给decoder看正确答案,这叫Teacher Forcing
            对于机器来说,他可能在训练资料没见过这些词,通过Beam Search 集束搜索解决
            decoder在训练时见到的都是正确的输出,但是测试时候会见到错误输出:这个不一致的现象叫exposure bias
            解决:在train时,给encoder一些错误的例子
            '''
            mask_pad_x = mask_pad(input).to("cuda")
            mask_tril_x = mask_tril(input).to("cuda")

            input=input.to("cuda")
            gt=gt.to("cuda")

            pred = model(input,gt[:, :-1],mask_pad_x,mask_tril_x) 
            pred = pred.reshape(-1, vocab_size)
            gt = gt[:, 1:].reshape(-1)

            # 忽略padding
            select = gt != vocabulary['<PAD>']
            pred = pred[select]
            gt = gt[select]

            loss = loss_func(pred, gt)
            optim.zero_grad()
            loss.backward()
            optim.step()

            if i % 200 == 0:
                pred = pred.argmax(1)
                correct = (pred == gt).sum().item()
                accuracy = correct / len(pred)
                lr = optim.param_groups[0]['lr']
                print("epoch:{:02d} iter:{:04d} lr:{:0.5f} loss:{:.6f} accuracy:{:.6f}".format(epoch, i, lr, loss.item(), accuracy))
                if accuracy>0.999:               
                    torch.save(model.state_dict(), 'weights.pth')                    
                    if val(model,max_seq_len):
                        return
                    model.train()
        sched.step()

def main():
    vocab_size=len(vocabulary)
    max_seq_len=50
    hidden_size=32
    head_num=4
    ffn_size=64

    model = Transformer(vocab_size,max_seq_len,hidden_size,head_num,ffn_size).cuda()
    if not os.path.exists('weights.pth'):
        train(model,vocab_size,max_seq_len)

    model.load_state_dict(torch.load('weights.pth'))
    val(model,max_seq_len)

main()

二.参考

三.输出

epoch:00 iter:0000 lr:0.00100 loss:2.702654 accuracy:0.035052
epoch:00 iter:0200 lr:0.00100 loss:2.235599 accuracy:0.156250
epoch:00 iter:0400 lr:0.00100 loss:1.755795 accuracy:0.491614
epoch:00 iter:0600 lr:0.00100 loss:1.366574 accuracy:0.582178
epoch:00 iter:0800 lr:0.00100 loss:1.126728 accuracy:0.622449
epoch:00 iter:1000 lr:0.00100 loss:0.927277 accuracy:0.692060
epoch:00 iter:1200 lr:0.00100 loss:0.654229 accuracy:0.787000
epoch:00 iter:1400 lr:0.00100 loss:0.464968 accuracy:0.860489
epoch:00 iter:1600 lr:0.00100 loss:0.278878 accuracy:0.912574
epoch:00 iter:1800 lr:0.00100 loss:0.190133 accuracy:0.944201
epoch:00 iter:2000 lr:0.00100 loss:0.114423 accuracy:0.961466
epoch:00 iter:2200 lr:0.00100 loss:0.048222 accuracy:0.986538
epoch:00 iter:2400 lr:0.00100 loss:0.044525 accuracy:0.990079
epoch:00 iter:2600 lr:0.00100 loss:0.042696 accuracy:0.987230
epoch:00 iter:2800 lr:0.00100 loss:0.044729 accuracy:0.985632
epoch:00 iter:3000 lr:0.00100 loss:0.038000 accuracy:0.989279
epoch:01 iter:0000 lr:0.00100 loss:0.036089 accuracy:0.989691
epoch:01 iter:0200 lr:0.00100 loss:0.024531 accuracy:0.991935
epoch:01 iter:0400 lr:0.00100 loss:0.057585 accuracy:0.986373
epoch:01 iter:0600 lr:0.00100 loss:0.010734 accuracy:0.996040
epoch:01 iter:0800 lr:0.00100 loss:0.013841 accuracy:0.996939
epoch:01 iter:1000 lr:0.00100 loss:0.023186 accuracy:0.993562
epoch:01 iter:1200 lr:0.00100 loss:0.024035 accuracy:0.988000
epoch:01 iter:1400 lr:0.00100 loss:0.008321 accuracy:0.995927
epoch:01 iter:1600 lr:0.00100 loss:0.009785 accuracy:0.996071
epoch:01 iter:1800 lr:0.00100 loss:0.019487 accuracy:0.994530
epoch:01 iter:2000 lr:0.00100 loss:0.008566 accuracy:0.997180
epoch:01 iter:2200 lr:0.00100 loss:0.001866 accuracy:1.000000
Gt:   10293847566574839201
Pred: 10293847566574839201
Gt:   10293847566574839201
Pred: 10293847566574839201

标签:Transformer,seq,转置,DEMO,self,mask,len,hidden,size
From: https://blog.csdn.net/m0_61864577/article/details/137291065

相关文章

  • 书生浦语第二期第二节课笔记(轻松玩转书生·浦语大模型趣味 Demo)
    以下内容是在InternStudio的开发机上运行的一、部署 InternLM2-Chat-1.8B 模型进行智能对话第一步:进入开发机后,在终端中输入以下环境命令配置进行环境配置studio-conda-ointernlm-base-tdemo#与studio-conda等效的配置方案#condacreate-ndemopython==3.10-......
  • Transformer简介
    参考:https://www.zhihu.com/tardis/bd/art/600773858?source_id=1001Transformer是谷歌在2017年的论文《AttentionIsAllYouNeed》中提出的,用于NLP的各项任务1、Transformer整体结构在机器翻译中,Transformer可以将一种语言翻译成另一种语言,如果把Transformer看成一个黑盒,那......
  • [Paper Reading] VQ-GAN: Taming Transformers for High-Resolution Image Synthesis
    名称[VQ-GAN](TamingTransformersforHigh-ResolutionImageSynthesis)时间:CVPR2021oral21.06机构:HeidelbergCollaboratoryforImageProcessing,IWR,HeidelbergUniversity,GermanyTL;DRTransformer优势在于能较好地长距离建模sequence数据,而CNN优势是天生对局部......
  • 视觉Transformer和Swin Transformer
    视觉Transformer概述ViT的基本结构:①输入图片首先被切分为固定尺寸的切片;②对展平的切片进行线性映射(通过矩阵乘法对维度进行变换);③为了保留切片的位置信息,在切片送入Transformer编码器之前,对每个切片加入位置编码信息;④Transformer编码器由L个Transformer模块组成,每个模......
  • 轻松分钟玩转书生·浦语大模型趣味 Demo
    一、基础作业:使用InternLM2-Chat-1.8B模型生成300字的小故事:使用猪猪模型部署并访问:二、进阶作业:完成Lagent工具调用数据分析Demo部署:熟悉huggingface下载功能,使用huggingface_hubpython包,下载InternLM2-Chat-7B的config.json文件到本地:......
  • SpringBoot集成MyBatis-Plus快速入门Demo
    目录1.MyBatis-Plus概述2.MyBatis-Plus框架结构3. MyBatis-Plus快速入门3.1 创建表3.2 创建工程3.3 导入依赖3.4添加配置文件application.yml,配置数据库信息3.5 创建实体类(包括自动填充)3.6创建配置类(包括配置乐观锁、分页、逻辑删除等插件) 3.7编写自动填......
  • 【GO】大小堆demo
    `packagemainimport("container/heap""fmt""math/rand")typeMyHeap[]MembertypeMemberstruct{scoreintnamerune}func(hMyHeap)Len()int{returnlen(h)}//Smallestheapfunc(hMyHeap)Less(i,jint)bool{r......
  • 学习transformer模型-Dropout的简明介绍
    Dropout的定义和目的:Dropout是一种神经网络正则化技术,它在训练时以指定的概率丢弃一个单元(以及连接)p。这个想法是为了防止神经网络变得过于依赖特定连接的共同适应,因为这可能是过度拟合的症状。直观上,dropout可以被认为是创建一个隐式的神经网络集合。PyTorch的nn.Drop......
  • Transformer
    (self-attention)自注意力机制将词汇转化为向量表示,可以使用one-hot进行编码,但是这种编码方式认为所有的词之间是没有关系的还有一种方法是WordEmbedding,这种方法相似的词汇会聚集到一起。输入文字、声音和图等的输入都是一堆向量输出1、N->N(如词性标注)2、N->1(情绪......
  • InternLM2 Demo初体验-书生浦语大模型实战营学习笔记2
    本文包括第二期实战营的第2课内容。本来是想给官方教程做做补充的,没想到官方教程的质量还是相当高的,跟着一步一步做基本上没啥坑。所以这篇笔记主要是拆解一下InternStudio封装的一些东西,防止在本地复现时出现各种问题。搭建环境首先是搭建环境这里,官方教程说:进入开发机后,在`t......