Hugging Face NLP课程学习记录 - 2. 使用 Hugging Face Transformers

说明：

首次发表日期：2024-09-19
官网： https://huggingface.co/learn/nlp-course/zh-CN/chapter2
关于：阅读并记录一下，只保留重点部分，大多从原文摘录，润色一下原文

2. 使用 Hugging Face Transformers

管道的内部（Behind the pipeline）

从例子开始：

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

原始文本（Raw text） --> 分词器（Tokenizer） --> 模型（Model）--> 后处理/预测（Predictions）

使用分词器进行预处理（Preprocessing with a tokenizer）

与其他神经网络一样，Transformer模型无法直接处理原始文本，因此我们管道的第一步是将文本输入转换为模型能够理解的数字。为此，我们使用tokenizer，负责：

将输入拆分为单词、子单词或符号（如标点符号），称为token
将每个token映射到一个整数
添加可能对模型有用的其他输入

我们使用AutoTokenizer类及其from_pretrained()方法获取与训练时相同的tokenizer。

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

一旦我们有了分词器（tokenizer），我们可以直接将句子传递给它，我们会得到一个字典（dictionary），这个字典已经准备好输入到我们的模型中了！唯一剩下要做的就是将输入ID的列表转换成张量。

Transformers的后端可能是Pytorch，Tensorflow或者Flax。

Transformers模型只接受张量作为输入。

要指定要返回的张量类型（PyTorch、TensorFlow或plain NumPy），我们使用return_tensors参数：

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}

输出本身是一个包含两个键的字典，分别是input_ids和attention_mask。input_ids包含两行整数（每句话一行），这些整数是每句话中词元（token）的唯一标识符。我们稍后会在本章解释attention_mask是什么。

了解模型（Go through the model）

我们可以像下载分词器（tokenizer）一样下载我们的预训练模型。

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states, also known as features. For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

高维向量（A high-dimensional vector?）

Transformer输出的向量一般很大。通常有3个维度：

Batch size: 一次处理的序列数（在我们的示例中为2）。
Sequence length: 序列的数值表示的长度（在我们的示例中为16）。
Hidden size: 每个模型输入的向量维度。

之所以被称为高维，是因为Hidden size. Hidden size可能非常大（768通常用于较小的型号，而在较大的型号中，这可能达到3072或更大）。

如果我们将预处理的输入输入到模型中，我们可以看到这一点：

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])

注意

标签：NLP,tokenizer,模型,ids,Hugging,Face,分词器,input,model
From： https://www.cnblogs.com/shizidushu/p/18420098

Taobao API interface: keyword search product list data interface
TaobaoAPIinterface:keywordsearchproductlistdatainterface——Ontheroadofgrowth,weareallfellowtravelers.IhopethisarticleabouttheTaobaoproductlistinformationinterfaceforproductselectioncanhelpyou.Ilookforwardtosharing......
huggingface 的 mnist 数据集的使用
由于原始的地址设置了登陆权限，所以，选择huggingface的mnist数据集使用。数据装载首先到hf网站下载相关数据集，地址是ylecun/mnist,然后在安装hf设计的数据集加载套件datasets工具包。用huggingface提供的工具下载到本地目录huggingface-clidownload--repo-typed......
huggingface上数据集常用格式Parquet及调用
Parquet格式解释及词解Parquet，中文通常翻译为帕奎或帕凯，是一种高效的列式存储格式。为什么叫Parquet？Parquet这个词源自法语，指的是一种镶木地板。这种地板是由许多小块木块拼凑而成的，每个木块代表一个数据列。这种比喻形象地说明了Parquet格式的存储方式。Parquet的特点和优......
Facebook直播限流是什么原因？是ip地址导致的吗
随着社交媒体和直播行业的蓬勃发展，Facebook直播已成为众多企业和个人进行品牌推广、产品展示和互动交流的重要平台。然而，在享受直播带来的便利与效益的同时，不少用户也面临着直播限流的困扰。本文将探讨Facebook直播限流的原因，并提出相应的应对策略。一、理解Facebook直播限流......
Facebook直播限流是什么原因？是ip地址导致的吗
随着社交媒体和直播行业的蓬勃发展，Facebook直播已成为众多企业和个人进行品牌推广、产品展示和互动交流的重要平台。然而，在享受直播带来的便利与效益的同时，不少用户也面临着直播限流的困扰。本文将探讨Facebook直播限流的原因，并提出相应的应对策略。一、理解Facebook直播限流......
EBS:OM Sales Order销售订单【Open Interface、Open API】
21. OM Sales Order销售订单【Open Interface、Open API】21.1. 快速参考。参考点内容功能导航N: OM/Orders, Returns/Sales Order并发请求N: OM/View/Request/Order Import接口表oe_headers_iface_all/oe_lines_iface_all/oe_actions_iface_all/….APIoe_order_pub.pr......
NLP学习1
使用书籍《pytroch自然语言处理入门与实战》1.常用库numpy科学计算matplotlib图表可视化scikit-learn数据挖掘和数据分析nltk包含50种语料和常见算法spacy实体命名，预训练词向量需要先安装对应语言的模型jieba中文分词pkusegpku论文的中文分词wn加载使用wordne......
SuperClassAndInterface
packagecom.shrimpking.t5;/***CreatedbyIntelliJIDEA.**@Author:Shrimpking*@create2024/9/1216:03*/publicinterfaceA{}packagecom.shrimpking.t5;/***CreatedbyIntelliJIDEA.**@Author:Shrimpking*@create2024/9/1216:03......
Taobao API interface: Taobao product details data interface
InterfaceOverviewCommoninterface:taobao.item_get,usedtoobtainproductdetailsdata.Returninformation:Youcanobtainalotofinformationabouttheproduct,suchasproductname,price,sales,evaluation,storeinformation,etc.Specificallyincl......
Taobao API interface: Get Taobao product details data based on product ID
TaobaoproductdetailsdataAPIinterfaceisasetofinterfacesprovidedbyTaobaoOpenPlatformforobtainingdetailedinformationofTaobaoproducts.Throughtheseinterfaces,developerscanintegrateTaobaoproductdataintheirownapplicationsorsy......

Hugging Face NLP课程学习记录 - 2. 使用 Hugging Face Transformers

Hugging Face NLP课程学习记录 - 2. 使用 Hugging Face Transformers

说明：

2. 使用 Hugging Face Transformers

管道的内部（Behind the pipeline）

使用分词器进行预处理（Preprocessing with a tokenizer）

了解模型（Go through the model）

高维向量（A high-dimensional vector?）

相关文章

赞助商

阅读排行