郑重声明:原文参见标题,如有侵权,请联系作者,将会撤销发布!
Abstract
1 Introduction
2 System Overview
3 Library Design
3.1 Lossless Tokenization
3.2 Efficient subword training and segmentation
3.3 Vocabulary id management
3.4 Customizable character normalization
3.5 Self-contained models
3.6 Library API for on-the-fly processing
4 Experiments
4.1 Comparison of different preprocessing
4.2 Segmentation performance
5 Conclusions
标签:independent,detokenizer,tokenizer,simple,Processing,Library,subword From: https://www.cnblogs.com/lucifer1997/p/18245397