首页 > 其他分享 >模型参数量计算(以transformer为例)

模型参数量计算(以transformer为例)

时间:2024-12-18 17:25:44浏览次数:10  
标签:transformer nn 为例 text dropout mask 量计算 model self

前言

模型中常见的可训练层包括卷积层和线性层,这里将给出计算公式并在pytorch下进行验证。

计算模型的参数:

import torch.nn as nn

def cal_params(model: nn.Module):
    num_learnable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    num_non_learnable_params = sum(p.numel() for p in model.parameters() if not p.requires_grad)
    return num_learnable_params, num_non_learnable_params

卷积层

\[n_\text{param} = (k_h \times k_w \times c_\text{in} + 1) \times c_\text{out} \]

其中,\(k_h,\ k_w\)表示卷积核的高和宽,\(c_\text{in},\ c_\text{out}\)表示输入和输出的通道数,\(+1\)表示偏置项。
举例:对于一个卷积层,输入通道数为3,输出通道数为64,卷积核大小为\(3\times3\),则参数量为:

\[n_\text{param} = (3 \times 3 \times 3 + 1) \times 64 = 1792 \]

测试


conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3)
print(cal_params(conv))
# (1792, 0)

线性层

线性层:\(y = Wx+b\),参数在权重\(W\)和偏置\(b\)中,参数量为:

\[n_\text{param} = (n_\text{in} + 1) \times n_\text{out} \]

举例:对于一个线性层,输入维度为64,输出维度为10,则参数量为:

\[n_\text{param} = (64 + 1) \times 10 = 650 \]

测试

linear = nn.Linear(in_features=64, out_features=10)
print(cal_params(linear))
# (650, 0)

Transformer的参数量计算

Encoder layer

结构图中,Encoder的每一层参数量由Multi-head attention层,Feed forward层和两层LayerNorm层组成,设特征的维度为\(d_\text{model}=d\)

Multi-head attention

共有四个线性层\(W_Q,\ W_K,\ W_V,\ W_O\),参数量为:$$n_\text{atten}=(d + 1) \times d \times 4=4d^2 + 4d$$

Feed forward

由两个线性层组成,中间隐藏层的维度为\(d_\text{ff}\),参数量为:\((d + 1) \times d_\text{ff} + (d_\text{ff} + 1) \times d\),论文中,\(d_\text{ff}=4d\),参数量为:

\[n_\text{ff}=(d + 1) \times 4d + (4d + 1) \times d=8d^2 + 5d \]

LayerNorm

两个LayerNorm层,每个LayerNorm层有两个参数\(\gamma,\ \beta\),参数量为:\(n_\text{ln}=2d\)

整个Encoder layer的参数量为:

\[n_\text{encoder layer} = n_\text{atten} + n_\text{ff} + 2n_\text{ln} = 4d^2 + 4d + 8d^2 + 5d + 4d = 12d^2 + 13d \]

举例:\(d_\text{model}=512\)时,Encoder layer的参数量为:

\[n_\text{encoder layer} = 12 \times 512^2 + 13 \times 512 = 3,152,384 \]

Encoder layer的实现:

class AddAndNorm(nn.Module):
    def __init__(self, d_model, dropout_prob, eps=1e-6):
        super().__init__()
        self.w = nn.Parameter(torch.ones(d_model))
        self.b = nn.Parameter(torch.zeros(d_model))
        self.eps = eps
        self.dropout = nn.Dropout(p=dropout_prob)

    def norm(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.w * (x - mean) / (std + self.eps) + self.b

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

def attention(q, k, v, mask=None, dropout=None):
    d_k = q.size(-1)
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        masking_value = -1e9 if scores.dtype == torch.float32 else -1e4
        scores = scores.masked_fill(mask == 0, masking_value)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)

    return torch.matmul(p_attn, v), p_attn

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout_prob=0.1):
        super().__init__()
        assert d_model % h == 0
        self.d_k = d_model // h
        self.h = h
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout_prob)

    def forward(self, q, k, v, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1)  # 相同的mask应用于所有的注意力头h
        batch_size = q.size(0)

        # 1) 执行线性变换,将 d_model 维度的 x 分割成 h 个 d_k 维度
        q = self.w_q(q).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        k = self.w_k(k).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        v = self.w_v(v).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

        # 2) 计算注意力
        x, self.attn = attention(q, k, v, mask=mask, dropout=self.dropout)

        # 3) 通过线性层连接多头注意力计算完的向量
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
        return self.w_o(x)

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout_prob=0.1):
        super().__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(p=dropout_prob)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, h, d_ff, dropout_prob):
        super().__init__()
        self.attn = MultiHeadedAttention(h, d_model, dropout_prob)
        self.ff = FeedForward(d_model, d_ff, dropout_prob)
        self.sublayer = nn.ModuleList([AddAndNorm(d_model, dropout_prob) for _ in range(2)])

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.attn(x, x, x, mask))
        return self.sublayer[1](x, self.ff)

d_model = 512
d_ff = 4 * d_model
h = 8
dropout_prob = 0.1
encoder_layer = EncoderLayer(d_model, h, d_ff, dropout_prob)
print(cal_params(encoder_layer))
# (3152384, 0)

注意:在《Attention is All You Need》中 Add & Norm层的实现是:$${\rm LayerNorm}(x + {\rm Sublayer}(x))$$但可以看到,我们的代码实现方式为$$x+{\rm Sublayer(LayerNorm(x))}$$这在网上亦有讨论[Discussion],总的来说,后者的实现方式更加稳定,在许多实践中以后者为主。

Decoder layer

对比结构,decoder 每层的可训练参数相较于encoder layer多了一个Multi-head attention层和一个LayerNorm层,参数量为:

\[n_\text{decoder layer} = n_\text{encoder layer} + n_\text{atten} + n_\text{ln} \]

\[= (12d^2 + 13d) + (4d^2 + 4d) + 2d = 16d^2 + 19d \]

举例:\(d_\text{model}=512\)时,Decoder layer的参数量为:

\[n_\text{decoder layer} = 16 \times 512^2 + 19 \times 512 = 4,204,032 \]

测试

class DecoderLayer(nn.Module):
    def __init__(self, d_model, h, d_ff, dropout_prob):
        super().__init__()
        self.self_attn = MultiHeadedAttention(h, d_model, dropout_prob)
        self.src_attn = MultiHeadedAttention(h, d_model, dropout_prob)
        self.ff = FeedForward(d_model, d_ff, dropout_prob)
        self.sublayer = nn.ModuleList([AddAndNorm(d_model, dropout_prob) for _ in range(3)])

    def forward(self, x, memory, src_mask, tgt_mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, memory, memory, src_mask))
        return self.sublayer[2](x, self.ff)
    d_model = 512
d_ff = 4 * d_model
h = 8
dropout_prob = 0.1
decoder_layer = DecoderLayer(d_model, h, d_ff, dropout_prob)
print(cal_params(decoder_layer))
# (4204032, 0)

Embedding层和生成层

Encoder和decoder分别使用一个embedding层,设encoder处理的词表大小为\(n_\text{src vocab}\),decoder处理的词表大小为\(n_\text{tgt vocab}\),特征维度为\(d_\text{model}=d\)。Encoder的embedding层参数量为:

\[n_\text{src emb}=n_\text{src vocab} \times d \]

Decoder的embedding层参数量为:

\[n_\text{tgt emb}=n_\text{tgt vocab} \times d \]

对于生成层,由于是一个线性层,参数量为:\(n_\text{gen}=(d + 1) \times n_\text{tgt vocab}\)。此外位置编码和残差连接没有可训练的参数。

举例:\(d_\text{model}=512\),\(n_\text{src vocab}=n_\text{tgt vocab}=10000\)时,Embedding层和生成层的参数量为:$$n_\text{src emb}=n_\text{tgt emb}=512 \times 10000=5,120,000$$

测试

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super().__init__()
        self.lut = nn.Embedding(num_embeddings=vocab, embedding_dim=d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

d_model = 512
d_ff = 4 * d_model
h = 8
dropout_prob = 0.1
vocal_size = 10000
embeddings = Embeddings(d_model, vocal_size)
print(cal_params(embeddings))
# (5120000, 0)

设encoder和decoder的层数为\(l\),则整个transformer的参数量为:

\[n = n_\text{src emb} + n_\text{tgt emb} + n_\text{gen} + l\times (n_\text{encoder layer} + n_\text{decoder layer}) \]

\[=n_\text{src vocab}\times d + n_\text{tgt vocab}\times d + (d + 1)\times n_\text{tgt vocab} + l\times (12d^2 + 13d + 16d^2 + 19d) \]

\[=d\times(n_\text{src vocab}+2n_\text{tgt vocab}) + n_\text{tgt vocab} + l\times(28d^2 + 32d) \]

举例:\(d_\text{model}=512,\ n_\text{src vocab}=n_\text{tgt vocab}=10000,\ l=6\)时,整个transformer的参数量为:

\[512\times30000 + 10000 + 6\times(28\times512^2 + 32\times512) = 59,508,496 \]

测试

class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model, d_ff, h, dropout_prob, layers):
        super().__init__()
        self.src_embed = nn.Sequential(Embeddings(d_model, src_vocab), PositionalEncoding(d_model, dropout_prob))
        self.tgt_embed = nn.Sequential(Embeddings(d_model, tgt_vocab), PositionalEncoding(d_model, dropout_prob))
        self.encoder = nn.ModuleList([EncoderLayer(d_model, h, d_ff, dropout_prob) for _ in range(layers)])
        self.decoder = nn.ModuleList([DecoderLayer(d_model, h, d_ff, dropout_prob) for _ in range(layers)])
        self.generator = nn.Linear(d_model, tgt_vocab)

    def forward(self, src, tgt, src_mask, tgt_mask):
        memory = self.encode(src, src_mask)
        return self.decode(memory, src_mask, tgt, tgt_mask)

    def encode(self, src, mask):
        x = self.src_embed(src)
        for layer in self.encoder:
            x = layer(x, mask)
        return x

    def decode(self, memory, src_mask, tgt, tgt_mask):
        x = self.tgt_embed(tgt)
        for layer in self.decoder:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.generator(x)


d_model = 512
d_ff = 4 * d_model
h = 8
layer = 6
dropout_prob = 0.1
vocal_size = 10000
model = Transformer(vocal_size, vocal_size, d_model, d_ff, h, dropout_prob, layer)
print(cal_params(model))
# (59508496, 0)

运行环境

Package                   Version
------------------------- -----------
torch                     2.5.0

完整的代码:

import torch
import math

from torch import nn

def cal_params(model: nn.Module):
    num_learnable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    num_non_learnable_params = sum(p.numel() for p in model.parameters() if not p.requires_grad)
    return num_learnable_params, num_non_learnable_params

class AddAndNorm(nn.Module):
    def __init__(self, d_model, dropout_prob, eps=1e-6):
        super().__init__()
        self.w = nn.Parameter(torch.ones(d_model))
        self.b = nn.Parameter(torch.zeros(d_model))
        self.eps = eps
        self.dropout = nn.Dropout(p=dropout_prob)

    def norm(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.w * (x - mean) / (std + self.eps) + self.b

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

def attention(q, k, v, mask=None, dropout=None):
    d_k = q.size(-1)
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        masking_value = -1e9 if scores.dtype == torch.float32 else -1e4
        scores = scores.masked_fill(mask == 0, masking_value)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)

    return torch.matmul(p_attn, v), p_attn

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout_prob=0.1):
        super().__init__()
        assert d_model % h == 0
        self.d_k = d_model // h
        self.h = h
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout_prob)

    def forward(self, q, k, v, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1)  # 相同的mask应用于所有的注意力头h
        batch_size = q.size(0)

        # 1) 执行线性变换,将 d_model 维度的 x 分割成 h 个 d_k 维度
        q = self.w_q(q).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        k = self.w_k(k).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        v = self.w_v(v).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

        # 2) 计算注意力
        x, self.attn = attention(q, k, v, mask=mask, dropout=self.dropout)

        # 3) 通过线性层连接多头注意力计算完的向量
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
        return self.w_o(x)

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout_prob=0.1):
        super().__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(p=dropout_prob)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, h, d_ff, dropout_prob):
        super().__init__()
        self.attn = MultiHeadedAttention(h, d_model, dropout_prob)
        self.ff = FeedForward(d_model, d_ff, dropout_prob)
        self.sublayer = nn.ModuleList([AddAndNorm(d_model, dropout_prob) for _ in range(2)])

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.attn(x, x, x, mask))
        return self.sublayer[1](x, self.ff)

class DecoderLayer(nn.Module):
    def __init__(self, d_model, h, d_ff, dropout_prob):
        super().__init__()
        self.self_attn = MultiHeadedAttention(h, d_model, dropout_prob)
        self.src_attn = MultiHeadedAttention(h, d_model, dropout_prob)
        self.ff = FeedForward(d_model, d_ff, dropout_prob)
        self.sublayer = nn.ModuleList([AddAndNorm(d_model, dropout_prob) for _ in range(3)])

    def forward(self, x, memory, src_mask, tgt_mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, memory, memory, src_mask))
        return self.sublayer[2](x, self.ff)

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super().__init__()
        self.lut = nn.Embedding(num_embeddings=vocab, embedding_dim=d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout_prob, max_len=5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout_prob)

        # 计算位置编码
        pe = torch.zeros(max_len, d_model)  # Shape: max_len x d_model
        position = torch.arange(0, max_len).unsqueeze(1)  # Shape: max_len x 1
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000) / d_model))
        res = position * div_term  # Shape: max_len x d_model/2
        pe[:, 0::2] = torch.sin(res)
        pe[:, 1::2] = torch.cos(res)
        pe = pe.unsqueeze(0)  # Shape: 1 x max_len x d_model
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)].requires_grad_(False)
        return self.dropout(x)


class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model, d_ff, h, dropout_prob, layers):
        super().__init__()
        self.src_embed = nn.Sequential(Embeddings(d_model, src_vocab), PositionalEncoding(d_model, dropout_prob))
        self.tgt_embed = nn.Sequential(Embeddings(d_model, tgt_vocab), PositionalEncoding(d_model, dropout_prob))
        self.encoder = nn.ModuleList([EncoderLayer(d_model, h, d_ff, dropout_prob) for _ in range(layers)])
        self.decoder = nn.ModuleList([DecoderLayer(d_model, h, d_ff, dropout_prob) for _ in range(layers)])
        self.generator = nn.Linear(d_model, tgt_vocab)

    def forward(self, src, tgt, src_mask, tgt_mask):
        memory = self.encode(src, src_mask)
        return self.decode(memory, src_mask, tgt, tgt_mask)

    def encode(self, src, mask):
        x = self.src_embed(src)
        for layer in self.encoder:
            x = layer(x, mask)
        return x

    def decode(self, memory, src_mask, tgt, tgt_mask):
        x = self.tgt_embed(tgt)
        for layer in self.decoder:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.generator(x)

if __name__ == '__main__':
    d_model = 512
    d_ff = 4 * d_model
    h = 8
    layer = 6
    dropout_prob = 0.1
    vocal_size = 10000
    model = Transformer(vocal_size, vocal_size, d_model, d_ff, h, dropout_prob, layer)
    print(cal_params(model))

参考文献

  1. Pytorch 查看PyTorch模型中的总参数数量

标签:transformer,nn,为例,text,dropout,mask,量计算,model,self
From: https://www.cnblogs.com/zh-jp/p/18615421

相关文章

  • LLaMA (以LLaMA2为例,文末附加对比1 2 3 三个版本的变化)
    一、背景LLaMA2和LLaMA2-Chat参数规模:70亿、130亿和700亿数据和训练规模:上下文长度训练资源性能表现:二、预训练pretraining1.预训练数据·训练语料来自公开课用的数据源,不包括Meta的产品或服务数据·在2万亿个数据tokens上进行了训练·对真实的数据源进行上采......
  • 安装Ranger(以CentOS为例)用于统一权限管理的详细步骤
    一、前提准备系统要求操作系统:CentOS7或更高版本(这里以CentOS7为例)。确保系统已经安装并配置好基本的网络设置,能够访问互联网进行软件包下载。软件依赖:需要安装JavaDevelopmentKit(JDK),推荐版本为1.8或更高。可以使用以下命令检查是否安装了JDK:java-version如果没有安......
  • 常见的Linux系统(以Ubuntu为例)中安装Redis的步骤
    一、安装准备更新系统软件包列表在安装Redis之前,先更新系统的软件包列表,以确保可以获取最新版本的Redis及其依赖项。在终端中执行以下命令:sudoapt-getupdate这个命令会从软件源服务器获取最新的软件包信息,包括软件包的版本、依赖关系等更新内容。安装编译工具和依......
  • Linux环境下安装MapReduce(以Hadoop MapReduce为例)的详细步骤
    一、前提条件操作系统准备确保你有一个合适的Linux发行版,如Ubuntu、CentOS等。以CentOS为例,系统应该是比较新的版本,并且已经完成了基本的系统更新。安装好Java运行环境(JDK),因为Hadoop是基于Java开发的。你可以通过以下命令检查Java是否安装:java-version。如果没有安装,在CentO......
  • Linux系统下安装Yarn(以Hadoop Yarn为例)的详细步骤
    一、前提条件安装JavaYarn是基于Java开发的,需要先安装JavaDevelopmentKit(JDK)。你可以从Oracle官方网站(https://www.oracle.com/java/technologies/javase-jdk11-downloads.html)下载适合你系统的JDK版本。安装完成后,设置JAVA_HOME环境变量。例如,在bash环境下,将以下内容添......
  • 好,我们以你的 `euclidolap.proto` 文件为例,调整代码结构,让服务逻辑更清晰,同时将 `eucl
    好,我们以你的euclidolap.proto文件为例,调整代码结构,让服务逻辑更清晰,同时将euclidolap模块分离到独立文件中。假设文件结构调整我们将euclidolap.proto生成的代码放到src/euclidolap模块中,同时将服务端逻辑分开组织。最终文件结构如下:project/├──build.rs......
  • Transformer模型训练参数的逻辑关系就像一棵树的生长系统
    Transformer模型训练参数的逻辑关系就像一棵树的生长系统【核心结论】Transformer模型训练参数构成了一个复杂的系统,从基础配置到模型架构,再到训练策略,各参数之间相互关联,共同影响着模型的性能和训练效果。此图展示了Transformer模型训练参数之间的逻辑关系。基础配置(......
  • Transformer:Attention is all you need
    摘要transformer是一种新的网络架构,它放弃了传统的循环和卷积,提供了一种编码器和解码器网络结构来完成任务,主要用于翻译任务中。它的优点为:更少的训练时间,较好的泛用性。1介绍循环神经网络模型包括长短期记忆(LSTM)和门控制神经网络模型,被确立为序列模型和转导问题,推动了循环语......
  • NLP界大牛讲Transformer自然语言处理的经典书!,466页pdf及代码
    《Transformer自然语言处理实战》本书涵盖了Transformer在NLP领域的主要应用。内容介绍:首先介绍Transformer模型和HuggingFace生态系统。然后重点介绍情感分析任务以及TrainerAPI、Transformer的架构,并讲述了在多语言中识别文本内实体的任务,以及Transformer模型生成......
  • 超强 !顶会创新融合!基于 2D-SWinTransformer 的并行分类网络
    往期精彩内容:Python-凯斯西储大学(CWRU)轴承数据解读与分类处理基于FFT+CNN-BiGRU-Attention时域、频域特征注意力融合的轴承故障识别模型-CSDN博客基于FFT+CNN-Transformer时域、频域特征融合的轴承故障识别模型-CSDN博客Python轴承故障诊断(11)基于VMD+CNN-B......