Huggingface:trainsformers的PreTrainedTokenizer类

标签：PreTrainedTokenizer trainsformers 1996 text ids Huggingface tokens return True

PreTrainedTokenizer类是所有分词类Tokenizer的基类，这个类不能够被实例化，

所有的transformers中预训练模型的分词器（例如BertTokenizer,RoBerta Tokenizer）等等都

继承自PreTrainedTokenizer类，并且实现了基类的方法。

基类的方法：

（1）__call__函数：

 1 __call__(
 2 text,text_pair,add_special_tokens,padding,truncation,
 3 max_length,stride,is_split_into_words,pad_to_multiple_of,
 4 return_tensors,return_token_type_ids,
 5 return_attention_mask,
 6 return_overflowing_tokens,
 7 return_special_tokens_mask,
 8 return_offsets_mapping,
 9 return_length,
10 verbose,
11 **kwargs
12 )

这个函数返回的是BatchEncoding类，这个类继承自python的字典的类型。

除此以外这个类还有一些将单词、字符转换成分词的方法。下面就以BertTokenizer为例子

说明各个参数的含义。

首先是参数txt，表示要变吗的序列或者序列的批次，参数text_pair表示分句的序列或者是序列批次。可以是str或者一个列表、一个str组成的列表。

如果是str促成的列表，那么参数is_split_into_words就应该为True。

【1】txt是单个str的情况：

1 tokenizer(txt='The sailors rode the breeze clear of the rocks.')

　　这时候的输出是如下的形式：

1 {'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 
　　'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
　　'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

其中input_ids中的101——cls，102——sep

【2】text为一个列表

1 tokenizer(text=["The sailors rode the breeze clear of the rocks."])
#输出形式：
2 {'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]],
 　　'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
　　 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

【3】text为一个string组成的列表，此时设置参数is_split_into_words=True

1 tokenizer(text=[["The", "sailors", "rode", "the", "breeze", "clear", "of", "the", "rocks"]], is_split_into_words=True)
2 {'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 102]], 
　　'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
　　 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

【4】text_pair和text的格式应一样，text为列表则text_pair也应为列表

1 tokenizer(text="The sailors rode the breeze clear of the rocks.",
2           text_pair="I demand that the more John eat, the more he pays.")
3 {'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 1045, 5157, 2008, \
　　　　　　　　　　1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

参数add_special_tokens表示是否根据模型向其中添加特殊的标记，例如[CLS],[seq],[pad]等标记。默认的情况下是True。

1 # add_special_tokens=False，不添加特殊标记
2 >>> tokenizer(text="The sailors rode the breeze clear of the rocks.",add_special_tokens=False)
3 {'input_ids': [1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

编码再解码后可以直观的展现special tokens

1 >>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks."],add_special_tokens=True)
2 >>> tokenizer.batch_decode(encodings["input_ids"])
3 ['[CLS] the sailors rode the breeze clear of the rocks. [SEP]']#解码出来[CLS]、[SEP]的token

下一个参数是padding，表示是否进行填充。参数truncation表示是否进行截取。

padding的值有两种的形式，（1）是True、False的布尔形式（2）是str类型的填充的策略，如'longest','max_length','do_not_pad'

同样的truncation也是两种，其中的第二种有'longest_first','only_first','only_second'.'do_not_truncate'.

其中padding与trunca默认为False。

如果对参数pad_to_multiple_of进行设置，那就将序列填充为所提供的值的倍数。

参数stride表示，如果设置truncation=True，return_overflowing_tokens=True将包含来自返回的截断序列模为的一些标记，以在截断序列和溢出序列之间提供一些重叠。

参数return_tensors表示要返回的数据类型，tf表示tensorflow张量，pt表示pytorch张量，np表示numpy数组。

参数return_offsets_mapping表示是否返回句子中每个单词的起始的位置，但是仅仅适用于Fast的分词器，比如BertTOkenizerFast.

下面要介绍的是encode（）函数。

标签：PreTrainedTokenizer,trainsformers,1996,text,ids,Huggingface,tokens,return,True
From： https://www.cnblogs.com/justkeen/p/16651611.html

Huggingface:trainsformers的PreTrainedTokenizer类

相关文章

赞助商

阅读排行