标签：Transformers tokenizer 权重记录模型 base 使用 output model

　　Transformers是著名的深度学习预训练模型集成库，包含NLP模型最多，CV等其他领域也有，支持预训练模型的快速使用和魔改，并且模型可以快速在不同的深度学习框架间（Pytorch/Tensorflow/Jax）无缝转移。以下记录基于HuggingFace官网教程：https://github.com/huggingface/transformers/blob/main/README_zh-hans.md

任务调用

　　直接使用两行代码实现各种任务，以下举例一个情感分析任务：

from transformers import pipeline
# 使用情绪分析流水线
classifier = pipeline('sentiment-analysis', 'distilbert-base-uncased-finetuned-sst-2-english')
classifier('We are very happy to introduce pipeline to the transformers repository.')

　　pipeline第一个参数传入实现任务类型，第二个参数传入预训练模型权重名。模型预训练权重名中，distilbert-base表示使用模型蒸馏训练的base bert；uncased表示模型权重无法区分大小写，数据在传入前需要小写处理；finetuned-sst-2-english表示模型权重在英文Stanford Sentiment Treebank 2数据集上进行微调。如果权重名能在当前工作目录中找到，就读取当前工作目录的文件，否则就会去HuggingFace官网下载相应的Repository。如果自动下载失败，distilbert-base-uncased-finetuned-sst-2-english的模型权重和配置文件可以通过以下方式下载：

git lfs install
git clone https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english

　　下载下来一个文件夹，其中包含模型结构文件 config.json、模型权重文件 model.safetensors、分词器配置文件 tokenizer_config.json、词表文件 vocab.txt等。文件夹中有时会包含文件分词器文件 tokenizer.json，其中保存了分词到id的映射。tokenizer.json的映射与vocab.txt正好相反，因此没有tokenizer.json照样可以运行。但是除了映射之外，tokenizer.json通常还会保存一些额外的关于特殊token或是未登录词的词频信息，是会影响模型结果的。

　　如果通过git模型权重下载失败，可以直接进网站下载单个权重文件并放入文件夹。其中后缀为h5、weights、ckpt、pth、safetensors、bin的文件都是模型权重。比如pth是pytorch常用的权重后缀，h5是Tensorflow的常用的权重后缀。具体保存的格式不细究，只要任意下载一个就行。Transformers默认使用Pytorch，因此通常下载pth、bin或safetensors。

　　通过以上API和下载的Repository文件，可以看出Transformers把用到的预训练模型、配置文件、分词等都放在一个repository中，从而在使用时实现模型结构的自动构建以及配套预训练权重的读取，从而无需显式使用Pytorch写好与预训练权重配套的结构代码，加快预训练模型使用流程。

预训练模型调用

　　如果要研究模型的推理，而不是实现具体任务。可以实现为以下代码：

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") #1
model = AutoModel.from_pretrained("bert-base-uncased") #2
inp = tokenizer("Hello world!", return_tensors="pt") #3
outp = model(**inp)

　　其中#1表示读取bert-base-uncased的分词器，#2表示读取bert-base-uncased的预训练权重并构建模型。如果模型权重只下载了h5，而使用Pytorch作为后端，则需要给from_pretrained添加from_tf=True参数。#3使用分词器对输入句子进行分词，输出pytorch张量。如果设置return_tensors="tf"则分词器输出兼容tensorflow模型的张量，此时model应该使用TFAutoModel来实例化。

　　如果要处理批量数据，可以给分词器传入文本列表，如：

texts = ["Hello world!", "Hello, how are you?"]
inp = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

　　如果给分词器传入两段文本，分词器将它们合并，并额外生成句子类型id，用于句子顺序判别任务。第一句token标识为0，第二句token标识为1：

texts = ["Hello world!", "Hello, how are you?"]
inp = tokenizer(*texts, return_tensors="pt", padding=True, truncation=True)

自定义模型推理

　　观察config.json，其中architectures字段定义了所需预训练权重所需使用的模型结构类，可以发现其它的各字段就是传入该模型结构类的参数，从而能实例化出与预训练模型权重一致的模型结构，然后再读取权重得到预训练模型。那么我们可以根据这些文件以及Transformers内置的模型结构类（继承自nn.Module），来自定义模型的数据通路。将前面的情感分类管道分解如下：

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from torch import nn

text = "We are very happy to introduce pipeline to the transformers repository."
model_head_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_head_name).to('cuda')
tokenizer = DistilBertTokenizer.from_pretrained(model_head_name)
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to('cuda')

# 获取模型内 bert 主体的输出
distilbert_output = model.distilbert(**inputs)
# 使用 bert 输出的第一个token [CLS] 计算情感分类概率
hidden_state = distilbert_output[0]  # (bs, seq_len, dim)
pooled_output = hidden_state[:, 0]  # (bs, dim)
pooled_output = model.pre_classifier(pooled_output)  # (bs, dim)
pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)
pooled_output = model.dropout(pooled_output)  # (bs, dim)
logits = model.classifier(pooled_output)  # (bs, num_labels)
print("Positive rate: ", nn.Softmax(1)(logits)[0,1].detach().cpu().numpy())

标签：Transformers,tokenizer,权重,记录,模型,base,使用,output,model
From： https://www.cnblogs.com/qizhou/p/17640915.html

Transformers包使用记录

任务调用

预训练模型调用

自定义模型推理

相关文章

赞助商

阅读排行