标签：bert Bert 训练 BERT -- Doc py DIR

Bert_Doc BERT文档中英文对照版

README.md https://github.com/google-research/bert

...

BERT	BERT模型的全称是：BidirectionalEncoder Representations from Transformer。从名字中可以看出，BERT模型的目标是利用大规模无标注语料训练、获得文本的包含丰富语义信息的Representation，即：文本的语义表示，然后将文本的语义表示在特定NLP任务中作微调，最终应用于该NLP任务。
*** New March 11th, 2020: Smaller BERT Models ***	2020年3月11日：更小的Bert模型
This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models.	这是24个较小的BERT模型（仅英语，无外壳，使用文字块掩蔽训练）的版本，参考在阅读良好的学生学习更好：关于预训练紧凑模型的重要性。
We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.	我们已经证明了标准的BERT配方（包括模型架构和训练目标）在广泛的模型规模上都是有效的，除了BERT-Base和BERT-Large。较小的BERT模型适用于计算资源有限的环境。它们可以以与原始BERT模型相同的方式进行微调。然而，它们在知识蒸馏的背景下是最有效的，其中微调标签是由一个更大、更准确的教师产生的。
Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. You can download all 24 from here, or individually from the table below:	我们的目标是能够在计算资源较少的机构中进行研究，并鼓励社区寻求创新的方向，而不是增加模型能力/容量。你可以从这里下载所有24个文件，也可以从下表中单独下载：

表格

	H=128	H=256	H=512	H=768
L=2	2/128 (BERT-Tiny)	2/256	2/512	2/768
L=4	4/128	4/256 (BERT-Mini)	4/512 (BERT-Small)	4/768
L=6	6/128	6/256	6/512	6/768
L=8	8/128	8/256	8/512 (BERT-Medium)	8/768
L=10	10/128	10/256	10/512	10/768
L=12	12/128	12/256	12/512	12/768 (BERT-Base)

...

Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model.

Here are the corresponding GLUE scores on the test set:

请注意，本版本中的BERT-Base模型仅是为了完整性；

它是在与原始模型相同的机制下重新训练的。以下是测试集上相应的GLUE分数：

表格

Model	Score	CoLA	SST-2	MRPC	STS-B	QQP	MNLI-m	MNLI-mm	QNLI(v2)	RTE	WNLI	AX
BERT-Tiny	64.2	0.0	83.2	81.1/71.1	74.3/73.6	62.2/83.4	70.2	70.3	81.5	57.2	62.3	21.0
BERT-Mini	65.8	0.0	85.9	81.1/71.8	75.4/73.3	66.4/86.2	74.8	74.3	84.1	57.9	62.3	26.1
BERT-Small	71.2	27.8	89.7	83.4/76.2	78.8/77.0	68.1/87.0	77.6	77.0	86.4	61.8	62.3	28.6
BERT-Medium	73.5	38.0	89.6	86.6/81.6	80.4/78.4	69.6/87.9	80.0	79.1	87.7	62.2	62.3	30.5

...

For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs:	对于每个任务，我们从下面的列表中选择了最佳的微调超参数，并进行了4个周期的训练：
batch sizes: 8, 16, 32, 64, 128 learning rates: 3e-4, 1e-4, 5e-5, 3e-5	批处理量：8, 16, 32, 64, 128 学习率： 3e-4, 1e-4, 5e-5, 3e-5
If you use these models, please cite the following paper: @article{turc2019, title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models}, author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, journal={arXiv preprint arXiv:1908.08962v2 }, year={2019} }	如果您使用这些模型，请引用以下论文： @article{turc2019, 标题={博学的学生学得更好：关于预训练紧凑模型的重要性}，作者={Turc，尤利亚和张，明伟和李，肯顿和图塔诺娃，克里斯汀娜}，期刊={arXiv预印本arXiv：1908.08962v2 }， year={2019} }
*** New May 31st, 2019: Whole Word Masking Models ***	***新版，2019年5月31日：全Word掩蔽模型***
This is a release of several new models which were the result of an improvement the pre-processing code.	这是改进预处理代码后的几个新模型的版本。
In the original pre-processing code, we randomly select WordPiece tokens to mask. For example: Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ' s head	在原始的预处理代码中，我们随机选择文字块字典进行掩码。例如：输入文字：那个人跳起来，把他的篮子放在菲尔的头上原始的掩蔽输入：男人戴上面具，把他的面具放在菲尔的头上
The new technique is called Whole Word Masking. In this case, we always mask all of the the tokens corresponding to a word at once. The overall masking rate remains the same.	这种新技术被称为全Word掩蔽技术。在这种情况下，我们总是一次屏蔽对应于一个单词的所有标记。总体的掩蔽率保持不变。
Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ' s head	整个Word掩蔽输入：那个人？起来，把他的篮子放在？？？的头上
The training is identical -- we still predict each masked WordPiece token independently. The improvement comes from the fact that the original prediction task was too 'easy' for words that had been split into multiple WordPieces.	训练是相同的——我们仍然可以独立地预测每个被掩蔽的文字块标记。这一改进来自于最初的预测任务对于那些被分成多个单词片段的单词来说太“容易”了。
This can be enabled during data generation by passing the flag --do_whole_word_mask=True to create_pretraining_data.py.	这可以在数据生成期间通过将标志——do_whole_word_mask=True传递给create_pretraining_data.py来启用。
Pre-trained models with Whole Word Masking are linked below. The data and training were otherwise identical, and the models have identical structure and vocab to the original models. We only include BERT-Large models. When using these models, please make it clear in the paper that you are using the Whole Word Masking variant of BERT-Large.	使用全Word掩蔽的预训练模型的链接如下。数据和训练在其他方面是相同的，并且模型与原始模型具有相同的结构和声音。我们只包括BERT-大型模型。当使用这些模型时，请在论文中明确说明，您使用的是BERT-Large的整个Word掩蔽变体。
BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters	大Bert，不区分大小写（全词掩蔽）：24层，1024隐藏层，16头，3.4亿个参数大Bert，区分大小写（全词掩蔽）：24层，1024隐藏层，16头，3.4亿个参数

表格

Model	SQUAD 1.1 F1/EM	Multi NLI Accuracy
BERT-Large, Uncased (Original)	91.0/84.3	86.05
BERT-Large, Uncased (Whole Word Masking)	92.8/86.7	87.07
BERT-Large, Cased (Original)	91.5/84.8	86.09
BERT-Large, Cased (Whole Word Masking)	92.9/86.7	86.46

发布日志

*** New February 7th, 2019: TfHub Module ***	***2019年2月7日： TfHub模块***
BERT has been uploaded to TensorFlow Hub. See run_classifier_with_tfhub.py for an example of how to use the TF Hub module, or run an example in the browser on Colab.	BERT已上传到张量流中心。请参见run_classifier_with_tfhub.py以了解如何使用TF Hub模块的示例，或在Colab上的浏览器中运行一个示例。
*** New November 23rd, 2018: Un-normalized multilingual model + Thai + Mongolian ***	***2018年11月23日：非标准化多语言模型+泰国+蒙古语***
We uploaded a new multilingual model which does not perform any normalization on the input (no lower casing, accent stripping, or Unicode normalization), and additionally inclues Thai and Mongolian.	我们上传了一个新的多语言模型，它不对输入执行任何规范化（没有低大小写、重音剥离或Unicode标准化），另外还包含了泰语和蒙古语。
It is recommended to use this version for developing multilingual models, especially on languages with non-Latin alphabets. This does not require any code changes, and can be downloaded here: BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters	建议使用这个版本来开发多语言模型，特别是在具有非拉丁字母的语言上。这不需要任何代码更改，可以在这里下载： bert基础版，多语言： 104种语言，12层，768个隐藏层，12个头，1.1亿参数
*** New November 15th, 2018: SOTA SQuAD 2.0 System ***	***2018年11月15日： SOTA SQuAD 2.0系统***
We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is currently 1st place on the leaderboard by 3%. See the SQuAD 2.0 section of the README for details.	我们发布了代码更改来重现我们83%的F1 SQuAD 2.0系统，它目前以3%的优势排名排行榜第一。有关详细信息，请参阅自述文件中的SQuAD 2.0部分。
*** New November 5th, 2018: Third-party PyTorch and Chainer versions of BERT available ***	***2018年11月5日：第三方PyTorch和连锁版本的BERT***
NLP researchers from HuggingFace made a PyTorch version of BERT available which is compatible with our pre-trained checkpoints and is able to reproduce our results. Sosuke Kobayashi also made a Chainer version of BERT available (Thanks!) We were not involved in the creation or maintenance of the PyTorch implementation so please direct any questions towards the authors of that repository.	来自HuggingFace的NLP研究人员提供了一个PyTorch版本的BERT版本，它与我们预先训练过的检查点兼容，并能够重现我们的结果。小林素助还制作了一个连锁版的BERT（谢谢！）我们没有参与PyTorch实现的创建或维护，所以请将任何问题直接指向该存储库的作者。
*** New November 3rd, 2018: Multilingual and Chinese models available ***	***2018年11月3日：多语言和中文型号可用的***
We have made two new BERT models available: · BERT-Base, Multilingual (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters · BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters	我们已经提供了两种新的BERT模型： BERT-基础版，多语言（不推荐，使用多语言包装代替）：102种语言，12层，768-隐藏，12头，110M参数 BERT-基础版，中文：简体繁体，12层，768隐藏，12头，110M参数
We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box without any code changes. We did update the implementation of BasicTokenizer in tokenization.py to support Chinese character tokenization, so please update if you forked it. However, we did not change the tokenization API. For more, see the Multilingual README.	我们在中文中使用基于字符的标记化，在所有其他语言中使用基于字符的标记化。这两种模型都应该开箱即用，没有任何代码更改。我们确实在标记化.py中更新了基本标记发生器的实现，以支持汉字标记化，所以如果你分叉了，请更新。但是，我们并没有改变令牌化API。欲了解更多信息，请参见多语言自述文件。
*** End new information ***	***结束***

BERT简介

Introduction

BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

简介

BERT，也称为基于Transformer的双向编码器的文本表示，是一种新的预训练语言表示方法，它在广泛的自然语言处理（NLP）任务上获得最先进的结果。

Our academic paper which describes BERT in detail and provides full results on a number of tasks can be found here: https://arxiv.org/abs/1810.04805.

我们的学术论文详细描述了BERT，并提供了许多任务的完整结果，可以在这里找到： https://arxiv.org/abs/1810.04805。

To give a few numbers, here are the results on the SQuAD v1.1 question answering task:

为了给出一些数字，以下是SQuAD v1.1问答任务的结果：

SQuAD v1.1 Leaderboard (Oct 8th 2018)	Test EM	Test F1
1st Place Ensemble - BERT	87.4	93.2
2nd Place Ensemble - nlnet	86.0	91.7
1st Place Single Model - BERT	85.1	91.8
2nd Place Single Model - nlnet	83.5	90.1

And several natural language inference tasks:以及几个自然语言推理任务：

System	MultiNLI	Question NLI	SWAG
BERT	86.7	91.1	86.3
OpenAI GPT (Prev. SOTA)	82.2	88.1	75.0

Plus many other tasks.还有许多其他任务。

Moreover, these results were all obtained with almost no task-specific neural network architecture design.

If you already know what BERT is and you just want to get started, you can download the pre-trained models and run a state-of-the-art fine-tuning in only a few minutes.

此外，这些结果都是在几乎没有特定任务的神经网络结构设计下获得的。

如果你已经知道什么是BERT，你只是想开始，你可以下载预先训练过的模型，并在短短几分钟内运行一个最先进的微调。

What is BERT?

BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

什么是BERT？

BERT是一种预训练语言表示的方法，这意味着我们在一个大型文本语料库（如维基百科）上训练一个通用的“语言理解”模型，然后将该模型用于我们关心的下游NLP任务（如问题回答）。

BERT优于以前的方法，因为它是第一个无监督的、深度双向的预训练NLP系统。

Unsupervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.

无监督意味着BERT只使用纯文本语料库进行训练，这很重要，因为大量的纯文本数据可以在网络上以许多语言公开获得。

Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single "word embedding" representation for each word in the vocabulary, so bank would have the same representation in bank deposit and river bank. Contextual models instead generate a representation of each word that is based on the other words in the sentence.

预先训练的表示也可以是上下文无关的或上下文有关的，而上下文表示进一步可以是单向的或双向的。

无上下文模型，如word2vec或GloVe，为词汇表中的每个单词生成一个单一的“单词嵌入”表示，因此bank 在bank deposit（银行存款）和river bank（河岸）中将具有相同的表示。相反，上下文模型基于句子中的其他单词生成每个单词的表示。

BERT was built upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit — but crucially these models are all unidirectional or shallowly bidirectional. This means that each word is only contextualized using the words to its left (or right). For example, in the sentence I made a bank deposit the unidirectional representation of bank is only based on I made a but not deposit. Some previous work does combine the representations from separate left-context and right-context models, but only in a "shallow" manner. BERT represents "bank" using both its left and right context — I made a ... deposit — starting from the very bottom of a deep neural network, so it is deeply bidirectional.

BERT是建立在最近关于训练前上下文表示的工作基础上的

——包括半监督序列学习、生成式预训练、ELMo和ULMFit——

但至关重要的是，这些模型都是单向的或浅双向的。这意味着每个单词只使用其左（或右）的单词进行上下文化。

例如，在我做了一个银行存款的句子中，银行的单向表示只是基于我做的一个存款，而不是存款。以前的一些工作确实结合了来自单独的左上下文和右上下文模型的表示，但只是以一种“浅”的方式。BERT用它的左右上下文来表示“银行”——我做了一个……存款-

从一个深度神经网络的最底部开始，所以它是深度双向的。

BERT uses a simple approach for this: We mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words. For example:

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.

Labels: [MASK1] = store; [MASK2] = gallon

BERT使用了一个简单的方法：

我们屏蔽掉输入中15%的单词，通过一个深度双向转换编码器运行整个序列，然后只预测被屏蔽的单词。

例如：那个人去了[MASK1]。他买了一[MASK2]牛奶。标签： [MASK1] =商店；[MASK2] =加仑

In order to learn relationships between sentences, we also train on a simple task which can be generated from any monolingual corpus: Given two sentences A and B, is B the actual next sentence that comes after A, or just a random sentence from the corpus?

为了学习句子之间的关系，我们还训练了一个简单的任务，可以从任何单语语料库中生成：

给定两个句子a和B，B是A之后的下一个句子，还是语料库中的一个随机句子？

Sentence A: the man went to the store .

Sentence B: he bought a gallon of milk .

Label: IsNextSentence

Sentence A: the man went to the store .

Sentence B: penguins are flightless .

Label: NotNextSentence

句子A：那个人去了商店。

句子B：他买了一加仑牛奶。

标签：B是A的下一句

句子A：那个人去了商店。

句子B：企鹅是不会飞的。

标签：B不是A的下一句（B和A无关联）

We then train a large model (12-layer to 24-layer Transformer) on a large corpus (Wikipedia + BookCorpus) for a long time (1M update steps), and that's BERT.

Using BERT has two stages: Pre-training and fine-tuning.

然后，我们在一个大型语料库（维基百科+图书语料库）上训练一个大型模型（12层到24层的转换器）很长时间（100万的更新步骤），这就是BERT。 使用BERT有两个阶段：预训练和微调。

Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language (current models are English-only, but multilingual models will be released in the near future). We are releasing a number of pre-trained models from the paper which were pre-trained at Google. Most NLP researchers will never need to pre-train their own model from scratch.

预训练是相当耗费资源的（在4到16个云tpu上运行4天），但对每种语言都是一次性的程序（这种训练虽然耗时，但是一劳永逸的）（目前的模型只使用英语，但多语言模型将在不久的将来发布）。

我们发布了一些在谷歌进行的预训练的模型。

大多数NLP研究人员将永远不需要从头开始预先训练他们自己的模型。

Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.

微调是相对简单和快捷的。本文中的所有结果都可以在单个云TPU上复制最多1小时，或者在GPU上复制几个小时，从完全相同的预训练模型开始。

例如，SQuAD可以在单个Cloud TPU上在大约30分钟内进行训练，以实现91%的Dev F1分数91.0%，这是最先进的单一系统。

The other important aspect of BERT is that it can be adapted to many types of NLP tasks very easily. In the paper, we demonstrate state-of-the-art results on sentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level (e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specific modifications.

BERT的另一个重要方面是，它可以很容易地适应许多类型的NLP任务。

在本文中，我们演示了句子级（如SST-2）、句子对级（如MultiNLI）、单词级（如NER）和跨度级（如SQuAD）任务的最新的结果，几乎没有特定任务的修改。

仓库内容介绍

What has been released in this repository?

在这个仓库中已经发布了什么？

We are releasing the following:

· TensorFlow code for the BERT model architecture (which is mostly a standard Transformer architecture).

· Pre-trained checkpoints for both the lowercase and cased version of BERT-Base and BERT-Large from the paper.

· TensorFlow code for push-button replication of the most important fine-tuning experiments from the paper, including SQuAD, MultiNLI, and MRPC.

我们将发布以下内容：

l BERT模型体系结构（主要是标准Transformer体系结构）的TF代码。

l 对BERT-基础版和BERT-大版本的小写和大写版本的预训练检查点。

l 用于按钮复制本文中最重要的微调实验的TF代码，包括SQuAD，MultiNLI，和MRPC。

All of the code in this repository works out-of-the-box with CPU, GPU, and Cloud TPU.

这个存储库中的所有代码都可以在CPU、GPU和Cloud TPU上运行。

Pre-trained models

We are releasing the BERT-Base and BERT-Large models from the paper. Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith. The Uncased model also strips out any accent markers.

Cased means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging).

预训练的模型

我们正在发布论文中发布BERT-Base和BERT-Large模型。

不区分大小写的意思是文本在文字标记化之前就被转换为小写了，例如，John Smith转换为john smith。

不区分大小写模型也去掉了任何口音标记。

大写表示保留了真实的情况和重音标记。通常，不区分大小写模型会更好，除非您知道那些信息对你的任务很重要（例如，命名实体识别或词性标记）。

These models are all released under the same license as the source code (Apache 2.0).

For information about the Multilingual and Chinese model, see the Multilingual README.

这些模型都是在与源代码（Apache 2.0）相同的许可下发布的。有关多语言和中文模型的信息，请参阅多语言自述文件。

When using a cased model, make sure to pass --do_lower=False to the training scripts. (Or pass do_lower_case=False directly to FullTokenizer if you're using your own script.)

The links to the models are here (right-click, 'Save link as...' on the name):

当使用区分大小写模型时，请确保将--do_lower=False传递给训练脚本。（如果使用自己的脚本，将do_lower=False直接传递给全标记器。）

模型的链接在这里（右键单击名称上的“链接另存为……”进行下载）：

版本

n BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters

n BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters

n BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters

n BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters

n BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters

n BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters

n BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

n BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

n BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

n Each .zip file contains three items:

u A TensorFlow checkpoint (bert_model.ckpt) containing the pre-trained weights (which is actually 3 files).

u A vocab file (vocab.txt) to map WordPiece to word id.

u A config file (bert_config.json) which specifies the hyperparameters of the model.

l BERT-Large，不区分大小写（全Word掩蔽）：24层，1024-隐藏，16头，340M参数

l BERT-Large，区分大小写（全Word掩蔽）：24层，1024-隐藏，16头，340M参数

l BERT-Base，不区分大小写：12层，768个隐藏，12个头，110M参数

l BERT-Large，不区分大小写：24层，1024-隐藏，16头，340M参数

l BERT-Base，区分大小写：12层，768个隐藏，12个头，110M参数

l BERT-Large，区分大小写：24层，1024个隐藏，16个头，340M参数

l BERT-Base，多语言区分大小写（新的，推荐）：104种语言，12层，768-隐藏，12-头，110M参数

l BERT-Base，多语言不区分大小写（Orig，不推荐）（不推荐使用，使用多语言无箱使用）：102种语言，12层，768隐藏，12头，110M参数

l BERT-Base，中文：简体繁体，12层，768隐藏，12头，110M参数每个.zip文件都包含三个项目：

Ø 包含预先训练的权重（实际上是3个文件）的拉伸检查点（bert_model.ckpt）。

Ø 将pord映射到word id的wocab文件（vocab.txt）。

Ø 一个指定模型的超参数的配置文件（bert_config.json）。

BERT的使用

Fine-tuning with BERT

BERT微调

Important: All results on the paper were fine-tuned on a single Cloud TPU, which has 64GB of RAM. It is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB - 16GB of RAM, because the maximum batch size that can fit in memory is too small. We are working on adding code to this repository which allows for much larger effective batch size on the GPU. See the section on out-of-memory issues for more details.

重要提示：论文上的所有结果都是在一个云TPU上进行微调的，它有64GB的内存。

目前还不可能使用12GB - 16GB内存的GPU重新生成论文上的大多数BERT-大结果，因为可以容纳内存的最大批大小太小了。我们正在努力向这个存储库中添加代码，这允许在GPU上实现更大的有效批处理大小。

有关更多细节，请参阅有关内存不足问题的部分。

This code was tested with TensorFlow 1.11.0. It was tested with Python2 and Python3 (but more thoroughly with Python2, since this is what's used internally in Google).

The fine-tuning examples which use BERT-Base should be able to run on a GPU that has at least 12GB of RAM using the hyperparameters given.

该代码用TensorFlow 1.11.0进行了测试。

它是用Python2和Python3进行测试的（但更彻底的是用Python2进行的，因为这是谷歌内部使用的）。

使用BERT-Base的微调示例应该能够在一个使用给出的超参数的至少有12GB RAM的GPU上运行。

Fine-tuning with Cloud TPUs

使用云TPU进行微调

Most of the examples below assumes that you will be running training/evaluation on your local machine, using a GPU like a Titan X or GTX 1080.

However, if you have access to a Cloud TPU that you want to train on, just add the following flags to run_classifier.py or run_squad.py:

--use_tpu=True \

--tpu_name=$TPU_NAME

下面的大多数示例都假设您将在本地机器上运行训练/评估，使用Titan X或GTX 1080等GPU。

但是，如果您可以访问您想要训练的云TPU，只需在run_classifier.py或run_squad.py上添加以下标志：

-- use_tpu=True

-- tpu_name=$TPU_NAME

Please see the Google Cloud TPU tutorial for how to use Cloud TPUs. Alternatively, you can use the Google Colab notebook "BERT FineTuning with Cloud TPUs".

On Cloud TPUs, the pretrained model and the output directory will need to be on Google Cloud Storage. For example, if you have a bucket named some_bucket, you might use the following flags instead:

--output_dir=gs://some_bucket/my_output_dir/

请参见谷歌Cloud TPU教程，了解如何使用云TPU。或者，你也可以使用谷歌Colab笔记本“BERT微调与云TPU ”。

在云TPU 上，预训练的模型和输出目录需要在谷歌云存储上。例如，如果您有一个名为some_bucket的桶，您可以使用以下标志：

——output_dir=gs：//some_bucket/my_output_dir/

The unzipped pre-trained model files can also be found in the Google Cloud Storage folder gs://bert_models/2018_10_18. For example:

export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12

解压缩的预训练的模型文件也可以在谷歌云存储文件夹

gs：//bert_models/2018_10_18中找到。

例如，export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12

分类任务

Sentence (and sentence-pair) classification tasks	句子（和句子对）分类任务
Before running this example you must download the GLUE data by running this script and unpack it to some directory $GLUE_DIR. Next, download the BERT-Base checkpoint and unzip it to some directory $BERT_BASE_DIR. This example code fine-tunes BERT-Base on the Microsoft Research Paraphrase Corpus (MRPC) corpus, which only contains 3,600 examples and can fine-tune in a few minutes on most GPUs.	在运行此示例之前，必须通过运行此脚本下载GLUE数据，并将其解包到某个目录$GLUE_DIR。接下来，下载BERT-Base检查点，并将其解压缩到某个目录$BERT_BASE_DIR中。这个示例代码基于微软研究释义语料库（MRPC）语料库，该语料库只包含3600个示例，可以在几分钟内对大多数gpu进行微调。
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12export GLUE_DIR=/path/to/glue	export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12export GLUE_DIR=/path/to/glue
python run_classifier.py \ --task_name=MRPC \ --do_train=true \ --do_eval=true \ --data_dir=$GLUE_DIR/MRPC \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=32 \ --learning_rate=2e-5 \ --num_train_epochs=3.0 \ --output_dir=/tmp/mrpc_output/	python run_classifier.py 配置信息任务名称是否进行训练是否进行评估数据路径词典文件路径 BERT配置文件路径检查点文件路径序列最大长度批处理量学习率训练周期数输出路径
You should see output like this: *** Eval results *** eval_accuracy = 0.845588 eval_loss = 0.505248 global_step = 343 loss = 0.505248	您应该会看到这样的输出： *** 评估结果 *** 评估准确率 = 0.845588 评估损失 = 0.505248 全局步数 = 343 损失= 0.505248
This means that the Dev set accuracy was 84.55%. Small sets like MRPC have a high variance in the Dev set accuracy, even when starting from the same pre-training checkpoint. If you re-run multiple times (making sure to point to different output_dir), you should see results between 84% and 88%.	这意味着Dev集的准确率为84.55%。像MRPC这样的小数据集在Dev集精度上有很高的方差，即使从相同的训练前检查点开始。如果您重新运行多次（确保指向不同的output_dir），您应该会看到84%到88%之间的结果。
A few other pre-trained models are implemented off-the-shelf in run_classifier.py, so it should be straightforward to follow those examples to use BERT for any single-sentence or sentence-pair classification task.	其他一些预先训练过的模型是在run_classifier.py中实现的，所以遵循这些例子来使用BERT进行任何单个句子或句子对分类任务应该是简单的。
Note: You might see a message Running train on CPU. This really just means that it's running on something other than a Cloud TPU, which includes a GPU.	注意：您可能会看到一个在CPU上运行训练的消息。这意味着它真的运行在一个云TPU上，其中包括一个GPU。
Prediction from classifier	分类器预测
Once you have trained your classifier you can use it in inference mode by using the --do_predict=true command. You need to have a file named test.tsv in the input folder. Output will be created in file called test_results.tsv in the output folder. Each line will contain output for each sample, columns are the class probabilities.	一旦分类器完成了训练，你就可以使用--do_predict=true命令在预测模式下使用它。您需要在输入文件夹中有一个名为test.tsv的文件。输出将在输出文件夹中的名为test_results.tsv的文件中创建。每一行都将包含每个示例的输出，列是分类的概率。
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12export GLUE_DIR=/path/to/glueexport TRAINED_CLASSIFIER=/path/to/fine/tuned/classifier	export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12export GLUE_DIR=/path/to/glueexport TRAINED_CLASSIFIER=/path/to/fine/tuned/classifier
python run_classifier.py \ --task_name=MRPC \ --do_predict=true \ --data_dir=$GLUE_DIR/MRPC \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$TRAINED_CLASSIFIER \ --max_seq_length=128 \ --output_dir=/tmp/mrpc_output/	python run_classifier.py的配置信息任务名称是否进行预测数据路径词典文件路径 BERT配置文件路径检查点序列最大长度输出路径

问答任务

SQuAD 1.1

The Stanford Question Answering Dataset (SQuAD) is a popular question answering benchmark dataset. BERT (at the time of the release) obtains state-of-the-art results on SQuAD with almost no task-specific network architecture modifications or data augmentation. However, it does require semi-complex data pre-processing and post-processing to deal with (a) the variable-length nature of SQuAD context paragraphs, and (b) the character-level answer annotations which are used for SQuAD training. This processing is implemented and documented in run_squad.py.

斯坦福问答数据集（SQuAD）是一个流行的问题回答基准数据集。

BERT（在发布时）在SQuAD上获得了最先进的结果，几乎没有针对特定任务的网络体系结构进行修改或数据增强。

然而，它确实需要有些复杂的数据预处理和后处理来处理SQuAD的上下文段落的可变长度性质，以及用于SQuAD训练的字符级答案注释。

此处理在run_squad.py中实现并记录。

To run on SQuAD, you will first need to download the dataset. The SQuAD website does not seem to link to the v1.1 datasets any longer, but the necessary files can be found here:

若要在SQuAD上运行，您将首先需要下载该数据集。

SQuAD网站似乎已经不再链接到v1.1版的数据集了，但必要的文件可以在这里找到：

Download these to some directory $SQUAD_DIR.

把这些资料下载到指定的目录$SQUAD_DIR。

The state-of-the-art SQuAD results from the paper currently cannot be reproduced on a 12GB-16GB GPU due to memory constraints (in fact, even batch size 1 does not seem to fit on a 12GB GPU using BERT-Large). However, a reasonably strong BERT-Base model can be trained on the GPU with these hyperparameters:

由于内存限制，目前论文中最先进的SQuAD结果不能在12GB-16GB的GPU上复制（即使是批处理量为1，BERT-Large也不适合在12GB GPU上运行）。

然而，一个相当强的BERT-Base模型可以在GPU上进行训练，并使用如下配置参数：

python run_squad.py \

--vocab_file=$BERT_BASE_DIR/vocab.txt \

--bert_config_file=$BERT_BASE_DIR/bert_config.json \

--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

--do_train=True \

--train_file=$SQUAD_DIR/train-v1.1.json \

--do_predict=True \

--predict_file=$SQUAD_DIR/dev-v1.1.json \

--train_batch_size=12 \

--learning_rate=3e-5 \

--num_train_epochs=2.0 \

--max_seq_length=384 \

--doc_stride=128 \

--output_dir=/tmp/squad_base/

python run_squad.py文件的配置信息

词典文件路径

BERT配置文件路径

检查点文件路径

是否筋洗净训练

训练文件路径

是否进行预测

预测文件路径

训练批处理量

学习率

训练周期

序列的最大长度

文件步距

输出路径

The dev set predictions will be saved into a file called predictions.json in the output_dir:

开发集预测将被保存到一个名为predictions.json的文件中。

python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ./squad/predictions.json

Which should produce an output like this:

{"f1": 88.41249612335034, "exact_match": 81.2488174077578}

python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ./squad/predictions.json

它应该会产生像这样的输出：

{"f1": 88.41249612335034, "exact_match": 81.2488174077578}

You should see a result similar to the 88.5% reported in the paper for BERT-Base.

你应该会看到一个类似于论文中BERT-Base的88.5%的结果。

If you have access to a Cloud TPU, you can train with BERT-Large. Here is a set of hyperparameters (slightly different than the paper) which consistently obtain around 90.5%-91.0% F1 single-system trained only on SQuAD:

如果您可以访问Cloud TPU，那么您可以使用BERT-Large进行训练。这里是一组超参数（与论文略有不同），它们一致地获得了仅在SQuAD上训练的90.5%-91.0% F1单系统：

python run_squad.py \

--vocab_file=$BERT_LARGE_DIR/vocab.txt \

--bert_config_file=$BERT_LARGE_DIR/bert_config.json \

--init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \

--do_train=True \

--train_file=$SQUAD_DIR/train-v1.1.json \

--do_predict=True \

--predict_file=$SQUAD_DIR/dev-v1.1.json \

--train_batch_size=24 \

--learning_rate=3e-5 \

--num_train_epochs=2.0 \

--max_seq_length=384 \

--doc_stride=128 \

--output_dir=gs://some_bucket/squad_large/ \

--use_tpu=True \

--tpu_name=$TPU_NAME

python run_squad.py配置信息

词典文件路径

BERT配置文件路径

检查点文件路径

是否进行训练

训练文件路径

是否进行预测

预测文件路径

批处理量

学习率

训练周期数

序列最大长度

步距

输出路径

是否使用TPU

TPU的名字

For example, one random run with these parameters produces the following Dev scores:

{"f1": 90.87081895814865, "exact_match": 84.38978240302744}

例如，使用这些参数进行一次随机运行会产生以下分数：

{"f1": 90.87081895814865, "exact_match": 84.38978240302744}

If you fine-tune for one epoch on TriviaQA before this the results will be even better, but you will need to convert TriviaQA into the SQuAD json format.

如果您在此之前对TriviaQA进行了一个训练周期的微调，结果会更好，但您需要将TriviaQA转换为SQuAD json格式。

SQuAD 2.0

This model is also implemented and documented in run_squad.py.

SQuAD 2.0

这个模型也在run_squad.py中实现和记录。

To run on SQuAD 2.0, you will first need to download the dataset. The necessary files can be found here:

要在SQuAD 2.0上运行，您首先需要下载数据集。必要的文件可以在这里找到：

Download these to some directory $SQUAD_DIR.

将这些资料下载到指定目录$SQUAD_DIR。

On Cloud TPU you can run with BERT-Large as follows:

在Cloud TPU上，您可以使用BERT-Large运行如下：

python run_squad.py \

--vocab_file=$BERT_LARGE_DIR/vocab.txt \

--bert_config_file=$BERT_LARGE_DIR/bert_config.json \

--init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \

--do_train=True \

--train_file=$SQUAD_DIR/train-v2.0.json \

--do_predict=True \

--predict_file=$SQUAD_DIR/dev-v2.0.json \

--train_batch_size=24 \

--learning_rate=3e-5 \

--num_train_epochs=2.0 \

--max_seq_length=384 \

--doc_stride=128 \

--output_dir=gs://some_bucket/squad_large/ \

--use_tpu=True \

--tpu_name=$TPU_NAME \

--version_2_with_negative=True

python run_squad.py文件的配置信息

词典文件路径

BERT配置文件路径

检查点文件路径

是否进行训练

训练文件路径

是否进行预测

预测文件路径

批处理量大小

学习率

训练周期数

序列最大长度

步距

输出路径

是否使用TPU

TPU名称

是否使用版本2

We assume you have copied everything from the output directory to a local directory called ./squad/. The initial dev set predictions will be at ./squad/predictions.json and the differences between the score of no answer ("") and the best non-null answer for each question will be in the file ./squad/null_odds.json

我们假设您已经将一切从输出目录复制到一个名为./scuan/的本地目录。最初的dev集预测将在./squad/predictions.json，每个问题的无答案得分（“”）和最佳非空答案之间的差异将在这个文件中：/squad/null_odds.json

Run this script to tune a threshold for predicting null versus non-null answers:

运行此脚本以调整预测空和非空答案的阈值：

python $SQUAD_DIR/evaluate-v2.0.py $SQUAD_DIR/dev-v2.0.json ./squad/predictions.json --na-prob-file ./squad/null_odds.json

Assume the script outputs "best_f1_thresh" THRESH. (Typical values are between -1.0 and -5.0). You can now re-run the model to generate predictions with the derived threshold or alternatively you can extract the appropriate answers from ./squad/nbest_predictions.json.

python $SQUAD_DIR/evaluate-v2.0.py $SQUAD_DIR/dev-v2.0.json ./squad/predictions.json --na-prob-file ./squad/null_odds.json

假设脚本输出“best_f1_thresh”。（典型值在-1.0和-5.0之间）。现在可以重新运行模型，以便使用派生的阈值生成预测，也可以从./squad/nbest_predictions.json中获取适当的答案。

python run_squad.py \

--vocab_file=$BERT_LARGE_DIR/vocab.txt \

--bert_config_file=$BERT_LARGE_DIR/bert_config.json \

--init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \

--do_train=False \

--train_file=$SQUAD_DIR/train-v2.0.json \

--do_predict=True \

--predict_file=$SQUAD_DIR/dev-v2.0.json \

--train_batch_size=24 \

--learning_rate=3e-5 \

--num_train_epochs=2.0 \

--max_seq_length=384 \

--doc_stride=128 \

--output_dir=gs://some_bucket/squad_large/ \

--use_tpu=True \

--tpu_name=$TPU_NAME \

--version_2_with_negative=True \

--null_score_diff_threshold=$THRESH

python run_squad.py文件的配置信息

词典文件路径

BERT配置文件路径

检查点文件路径

是否进行训练

训练文件路径

是否进行预测

预测文件路径

批处理量

学习率

训练周期

序列最大长度

步距

输出路径

是否使用TPU

TPU名称

是否使用版本2

空得分阈值

内存不足

Out-of-memory issues	内存不足的问题
All experiments in the paper were fine-tuned on a Cloud TPU, which has 64GB of device RAM. Therefore, when using a GPU with 12GB - 16GB of RAM, you are likely to encounter out-of-memory issues if you use the same hyperparameters described in the paper. The factors that affect memory usage are:	论文中的所有实验都是在一个设备内存为64GB的Cloud TPU上进行微调的。因此，当使用具有12GB-16GBRAM的GPU时，如果您使用本文中描述的相同超参数，您可能会遇到内存不足的问题。影响内存使用量的因素有：
max_seq_length: The released models were trained with sequence lengths up to 512, but you can fine-tune with a shorter max sequence length to save substantial memory. This is controlled by the max_seq_length flag in our example code.	max_seq_length：发布的模型训练的序列长度高达512，但你可以用更短的最大序列长度进行微调，以节省大量内存。这是由我们的示例代码中的max_seq_length标志控制的。
train_batch_size: The memory usage is also directly proportional to the batch size.	内存使用量也与批处理大小成正比。
Model type, BERT-Base vs. BERT-Large: The BERT-Large model requires significantly more memory than BERT-Base.	模型类型，BERT-Base vs。BERT-大模型：BERT-大模型模型比BERT-Base需要更多的内存。
Optimizer: The default optimizer for BERT is Adam, which requires a lot of extra memory to store the m and v vectors. Switching to a more memory efficient optimizer can reduce memory usage, but can also affect the results. We have not experimented with other optimizers for fine-tuning.	优化器：BERT的默认优化器是Adam，它需要大量额外的内存来存储m和v向量。切换到更高效的内存优化器可以减少内存使用，但也会影响结果。我们没有尝试其他优化器进行微调。
Using the default training scripts (run_classifier.py and run_squad.py), we benchmarked the maximum batch size on single (12GB RAM) with TensorFlow 1.11.0:	使用默认的训练脚本（run_classifier.py和run_squad.py），我们在单个Titan X GPU（12GBRAM）上对最大批处理进行了测试（使用TensarFlow1.11.0）：

系统序列长度最大批处理量

System	Seq Length	Max Batch Size
BERT-Base	64	64
...	128	32
...	256	16
...	320	14
...	384	12
...	512	6
BERT-Large	64	12
...	128	6
...	256	2
...	320	1
...	384	0
...	512	0

...

Unfortunately, these max batch sizes for BERT-Large are so small that they will actually harm the model accuracy, regardless of the learning rate used. We are working on adding code to this repository which will allow much larger effective batch sizes to be used on the GPU. The code will be based on one (or both) of the following techniques:	不幸的是，这些针对BERT-Large的最大批处理大小是如此之小，因此无论使用的学习率如何，它们实际上都会损害模型的准确性。我们正在努力向这个存储库中添加代码，这将允许在GPU上使用更大的有效批处理大小。该代码将基于以下一种（或两种）技术：
Gradient accumulation: The samples in a minibatch are typically independent with respect to gradient computation (excluding batch normalization, which is not used here). This means that the gradients of multiple smaller minibatches can be accumulated before performing the weight update, and this will be exactly equivalent to a single larger update.	梯度积累：在梯度计算方面，一个小批中的样本通常是独立的（不包括批归一化，这里不使用）。这意味着，在执行重量更新之前，可以累积多个较小的小批次的梯度，而这将完全相当于一个较大的更新。
Gradient checkpointing: The major use of GPU/TPU memory during DNN training is caching the intermediate activations in the forward pass that are necessary for efficient computation in the backward pass. "Gradient checkpointing" trades memory for compute time by re-computing the activations in an intelligent way.	梯度检查点：在DNN训练中，GPU/TPU内存的主要用途是缓存前向传递中的中间激活函数，这是在反向传播中进行有效计算所必需的。”梯度检查点“通过以智能的方式重新计算激活函数，以便用空间（memory ）换时间（compute time）。
However, this is not implemented in the current release. Using BERT to extract fixed feature vectors (like ELMo)	但是，这在当前的版本中并没有实现。使用BERT提取固定的特征向量（如ELMo）
In certain cases, rather than fine-tuning the entire pre-trained model end-to-end, it can be beneficial to obtained pre-trained contextual embeddings, which are fixed contextual representations of each input token generated from the hidden layers of the pre-trained model. This should also mitigate most of the out-of-memory issues.	在某些情况下，与其端到端微调整个预训练的模型，不如获得预训练的上下文嵌入，这是由预训练模型的隐藏层生成的每个输入令牌的固定上下文表示。这也可以减轻大多数内存不足的问题。
As an example, we include the script extract_features.py which can be used like this:	举个例子，脚本extract_features.py可以按照下述方式进行使用：
# Sentence A and Sentence B are separated by the \|\|\| delimiter for sentence# pair tasks like question answering and entailment. # For single sentence inputs, put one sentence per line and DON'T use the# delimiter.echo 'Who was Jim Henson ? \|\|\| Jim Henson was a puppeteer' > /tmp/input.txt	对于句子对，其中的句子A和句子B可以用\|\|\|分隔符分隔，比如问题回答和？？？。对于单个句子输入，每一行放一个句子，不要使用#分隔符。例如，谁是吉姆·亨森？\|\|\| Jim Henson是一个木偶戏表演者 > /tmp/input.txt
python extract_features.py \ --input_file=/tmp/input.txt \ --output_file=/tmp/output.jsonl \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --layers=-1,-2,-3,-4 \ --max_seq_length=128 \ --batch_size=8	python extract_features.py 文件的配置信息输入文件路径输出文件路径词典文件路径 BERT配置文件路径检查点文件路径层序列最大长度批处理大小
This will create a JSON file (one line per line of input) containing the BERT activations from each Transformer layer specified by layers (-1 is the final hidden layer of the Transformer, etc.)	这将创建一个JSON文件（每行输入对应文件中的一行），其中包含由图层指定的每个Transformer 层的BERT激活函数（-1是Transformer的最终隐藏层，等等）。
Note that this script will produce very large output files (by default, around 15kb for every input token). If you need to maintain alignment between the original and tokenized words (for projecting training labels), see the Tokenization section below.	注意，这个脚本将产生非常大的输出文件（默认情况下，每个输入令牌大约为15kb）。如果需要保持原始单词和标记化单词之间的对齐（用于投影训练标签），请参阅下面的标记化部分。
Note: You may see a message like Could not find trained model in model_dir: /tmp/tmpuB5g5c, running initialization to predict. This message is expected, it just means that we are using the init_from_checkpoint() API rather than the saved model API. If you don't specify a checkpoint or specify an invalid checkpoint, this script will complain.	注意：正在运行初始化进行预测的时候，您可能会看到“在model_dir： /tmp/tmpuB5g5c中无法找到训练模型”这样的消息。这条消息是我们所期望的，它只是意味着我们使用init_from_checkpoint（）API，而不是保存的模型API。如果没有指定检查点或指定无效检查点，此脚本会报错。
Tokenization For sentence-level tasks (or sentence-pair) tasks, tokenization is very simple. Just follow the example code in run_classifier.py and extract_features.py. The basic procedure for sentence-level tasks is:	词元化对于句子级的任务（或句子对）任务，标记化非常简单。只需遵循run_classifier.py和extract_features.py中的示例代码。句子级任务的基本程序是：
1.Instantiate an instance of tokenizer = tokenization.FullTokenizer	1.实例化一个tokenizer 对象tokenizer = tokenization.FullTokenizer
2.Tokenize the raw text with tokens = tokenizer.tokenize(raw_text).	2.标记原始文本，tokenizer.tokenize（raw_text）。
3.Truncate to the maximum sequence length. (You can use up to 512, but you probably want to use shorter if possible for memory and speed reasons.)	3.截断到最大的序列长度。（最大序列长度512，但由于内存和速度的原因，最好还是用短一些的序列长度。）
4.Add the [CLS] and [SEP] tokens in the right place.	4.在正确的位置添加[CLS]和[SEP]标记。
Word-level and span-level tasks (e.g., SQuAD and NER) are more complex, since you need to maintain alignment between your input text and output text so that you can project your training labels. SQuAD is a particularly complex example because the input labels are character-based, and SQuAD paragraphs are often longer than our maximum sequence length. See the code in run_squad.py to show how we handle this.	单词级和跨级任务（例如，SQuAD和NER）更加复杂，因为您需要在输入文本和输出文本之间保持对齐，以便您可以投影训练标签。 SQuAD是一个特别复杂的例子，因为输入标签是基于字符的，而且SQuAD段落通常比我们的最大序列长度要长。请查看run_squad.py中的代码来说明我们如何处理这个问题。
Before we describe the general recipe for handling word-level tasks, it's important to understand what exactly our tokenizer is doing. It has three main steps:	在我们描述处理单词级任务的一般配方之前，了解我们的标记化器到底正在做什么是很重要的。它设有三个主要步骤：
Text normalization: Convert all whitespace characters to spaces, and (for the Uncased model) lowercase the input and strip out accent markers. E.g., John Johanson's, → john johanson's	文本规范化：将所有空格字符转换为空格，（对于未区分大小写的模型）将输入转换为小写，并去掉重音标记。例如，John Johanson's, → john johanson's
Punctuation splitting: Split all punctuation characters on both sides (i.e., add whitespace around all punctuation characters). Punctuation characters are defined as (a) Anything with a P* Unicode class, (b) any non-letter/number/space ASCII character (e.g., characters like $ which are technically not punctuation). E.g., john johanson's, → john johanson ' s	标点符号分割：分割两侧的所有标点字符（在所有标点字符周围添加空格）。标点字符被定义为 (a) 任何具有P* Unicode类的东西 (b)任何非字母/数字/空间的ASCII字符（例如，像$这样的字符，在技术上不是标点符号）例如，john johanson's, → john johanson ' s
WordPiece tokenization: Apply whitespace tokenization to the output of the above procedure, and apply WordPiece tokenization to each token separately. (Our implementation is directly based on the one from tensor2tensor, which is linked). E.g., john johanson ' s , → john johan ##son ' s	文件块标记化：对上述过程的输出应用空白标记化，并将文件块标记化分别应用于每个标记。（我们的实现基于tensor2tensor，它们是相连的）。例如，john johanson ' s , → john johan ##son ' s
The advantage of this scheme is that it is "compatible" with most existing English tokenizers. For example, imagine that you have a part-of-speech tagging task which looks like this:	这个方案的优点是它与大多数现有的英语标记化器“兼容”。例如，假设您有一个像这样的词性标记任务：
Input: John Johanson 's house Labels: NNP NNP POS NN The tokenized output will look like this: Tokens: john johan ##son ' s house	输入: John Johanson 's house 标签: NNP NNP POS NN 标记后的输出如下：标记: john johan ##son ' s house
Crucially, this would be the same output as if the raw text were John Johanson's house (with no space before the 's). If you have a pre-tokenized representation with word-level annotations, you can simply tokenize each input word independently, and deterministically maintain an original-to-tokenized alignment:	至关重要的是，这将是相同的输出，就好像原始文本是John Johanson's house（在s之前没有空格）。如果您有一个带有单词级注释的预标记化表示，您可以简单地独立地标记每个输入字，并确定地保持原始到标记化的对齐：
### Inputorig_tokens = ["John", "Johanson", "'s", "house"]labels = ["NNP", "NNP", "POS", "NN"] ### Outputbert_tokens = []	### Inputorig_tokens = ["John", "Johanson", "'s", "house"]labels = ["NNP", "NNP", "POS", "NN"] ### Outputbert_tokens = []
# Token map will be an int -> int mapping between the `orig_tokens` index and# the `bert_tokens` index.orig_to_tok_map = []	标记映射是int到int的映射，也就是orig_tokens的索引到bert_tokens索引的映射；
tokenizer = tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True) bert_tokens.append("[CLS]")for orig_token in orig_tokens: orig_to_tok_map.append(len(bert_tokens)) bert_tokens.extend(tokenizer.tokenize(orig_token))bert_tokens.append("[SEP]") # bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]# orig_to_tok_map == [1, 2, 4, 6]	tokenizer = tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True) bert_tokens.append("[CLS]")for orig_token in orig_tokens: orig_to_tok_map.append(len(bert_tokens)) bert_tokens.extend(tokenizer.tokenize(orig_token))bert_tokens.append("[SEP]") # bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]# orig_to_tok_map == [1, 2, 4, 6]
Now orig_to_tok_map can be used to project labels to the tokenized representation.	现在可以使用orig_to_tok_map将标签投射到标记化的表示中。
There are common English tokenization schemes which will cause a slight mismatch between how BERT was pre-trained. For example, if your input tokenization splits off contractions like do n't, this will cause a mismatch. If it is possible to do so, you should pre-process your data to convert these back to raw-looking text, but if it's not possible, this mismatch is likely not a big deal.	有一些常见的英语标记化方案，这将导致BERT的预训练方式之间出现轻微的不匹配。例如，如果您的输入标记化像do n't那样分割，这将导致不匹配。如果有可能这样做，您应该预处理数据，以将其转换回原始文本，但如果不可能，这种不匹配可能不是什么大问题。
Pre-training with BERT	BERT预训练
We are releasing code to do "masked LM" and "next sentence prediction" on an arbitrary text corpus. Note that this is not the exact code that was used for the paper (the original code was written in C++, and had some additional complexity), but this code does generate pre-training data as described in the paper.	我们正在发布代码，在任意的文本语料库上进行“掩码语言模型”和“下一个句子预测”。请注意，这并不是论文中使用的确切代码（原始代码是用C++编写的，并且有一些额外的复杂性），但这段代码确实生成了如论文中所述的训练前数据。
Here's how to run the data generation. The input is a plain text file, with one sentence per line. (It is important that these be actual sentences for the "next sentence prediction" task). Documents are delimited by empty lines. The output is a set of tf.train.Examples serialized into TFRecord file format.	下面是如何运行数据生成方法。输入是一个纯文本文件，每行有一个句子。（重要的是，这些都是“下一个句子预测”任务的实际句子）。文档由空行分隔。输出的是一组tf.train.Examples，序列化为TFRecord文件格式的示例。
You can perform sentence segmentation with an off-the-shelf NLP toolkit such as spaCy. The create_pretraining_data.py script will concatenate segments until they reach the maximum sequence length to minimize computational waste from padding (see the script for more details). However, you may want to intentionally add a slight amount of noise to your input data (e.g., randomly truncate 2% of input segments) to make it more robust to non-sentential input during fine-tuning.	您可以使用现成的NLP工具包来执行句子分割。create_pretraining_data.py脚本将连接片段，直到它们达到最大序列长度，以减少填充造成的计算浪费（更多细节请参阅脚本）。但是，您可能希望有意地向输入数据中添加少量的噪声（例如，随机截断2%的输入段），以使其在微调期间对非句子输入更健壮。
This script stores all of the examples for the entire input file in memory, so for large data files you should shard the input file and call the script multiple times. (You can pass in a file glob to run_pretraining.py, e.g., tf_examples.tf_record.) The max_predictions_per_seq is the maximum number of masked LM predictions per sequence. You should set this to around max_seq_length masked_lm_prob (the script doesn't do that automatically because the exact value needs to be passed to both scripts).	这个脚本将整个输入文件的所有示例存储在内存中，因此对于大型数据文件，您应该分割输入文件并多次调用该脚本。（您可以将一个文件包传入给run_pretraining.py，例如，tf_examples.tf_record。） max_predictions_per_seq是每个序列的masked LM预测的最大数量。您应该将其设置为max_seq_length masked_lm_prob左右（脚本不会自动这样做，因为确切的值需要传递给两个脚本）。
python create_pretraining_data.py \ --input_file=./sample_text.txt \ --output_file=/tmp/tf_examples.tfrecord \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --do_lower_case=True \ --max_seq_length=128 \ --max_predictions_per_seq=20 \ --masked_lm_prob=0.15 \ --random_seed=12345 \ --dupe_factor=5	python create_pretraining_data.py 文件的配置信息输入文件路径输出文件路径词典文件路径是否转换为小写序列最大长度每个序列的最大预测数掩蔽的比例随机数种子欺骗系数
Here's how to run the pre-training. Do not include init_checkpoint if you are pre-training from scratch. The model configuration (including vocab size) is specified in bert_config_file. This demo code only pre-trains for a small number of steps (20), but in practice you will probably want to set num_train_steps to 10000 steps or more. The max_seq_length and max_predictions_per_seq parameters passed to run_pretraining.py must be the same as create_pretraining_data.py.	以下是如何运行预训练。如果你是从头开始的预训练，请不要包括init_checkpoint。模型配置（包括vocab大小）在bert_config_file中指定。这个演示代码只对少量步骤（20步）进行预训练，但在实践中，您可能希望将num_train_steps设置为10000步或更多。传递给run_pretraining.py的max_seq_length和max_predictions_per_seq参数必须与create_pretraining_data.py相同。
python run_pretraining.py \ --input_file=/tmp/tf_examples.tfrecord \ --output_dir=/tmp/pretraining_output \ --do_train=True \ --do_eval=True \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --train_batch_size=32 \ --max_seq_length=128 \ --max_predictions_per_seq=20 \ --num_train_steps=20 \ --num_warmup_steps=10 \ --learning_rate=2e-5	python run_pretraining.py文件的配置信息输入文件路径输出文件路径是否进行训练是否进行评估evaluation BERT配置文件路径检查点文件路径批处理量序列的最大长度每个序列最大预测数训练步数热身步数学习率
This will produce an output like this: *** Eval results *** global_step = 20 loss = 0.0979674 masked_lm_accuracy = 0.985479 masked_lm_loss = 0.0979328 next_sentence_accuracy = 1.0 next_sentence_loss = 3.45724e-05	这将产生如下输出： *** Eval results *** global_step = 20 loss = 0.0979674 masked_lm_accuracy = 0.985479 masked_lm_loss = 0.0979328 next_sentence_accuracy = 1.0 next_sentence_loss = 3.45724e-05
Note that since our sample_text.txt file is very small, this example training will overfit that data in only a few steps and produce unrealistically high accuracy numbers.	请注意，由于我们的sample_text.txt文件非常小，这个示例训练将只在几个步骤中过度拟合这些数据，并产生不切实际的高精度数字。

预训练注意事项

Pre-training tips and caveats	训练前的提示和警告
n If using your own vocabulary, make sure to change vocab_size in bert_config.json. If you use a larger vocabulary without changing this, you will likely get NaNs when training on GPU or TPU due to unchecked out-of-bounds access.	如果使用您自己的词汇表，请确保在bert_config.json中更改vocab_size。如果你使用更大的词汇表而不改变这一点，你可能会在接受GPU或TPU培训时获得注册网络，因为未检查出的越界访问。
n If your task has a large domain-specific corpus available (e.g., "movie reviews" or "scientific papers"), it will likely be beneficial to run additional steps of pre-training on your corpus, starting from the BERT checkpoint.	如果您的任务有一个大的特定领域的语料库（例如，“电影评论”或“科学论文”），那么在您的语料库上从BERT检查点开始运行额外的预训练步骤可能是有益的。
n The learning rate we used in the paper was 1e-4. However, if you are doing additional steps of pre-training starting from an existing BERT checkpoint, you should use a smaller learning rate (e.g., 2e-5).	我们在论文中使用的学习率是1e-4。但是，如果您正在从现有的BERT检查点开始进行额外的预训练步骤，那么您应该使用更小的学习率（例如，2e-5）。
n Current BERT models are English-only, but we do plan to release a multilingual model which has been pre-trained on a lot of languages in the near future (hopefully by the end of November 2018).	目前的BERT模型只使用英语，但我们计划发布一个多语言模型，在不久的将来（希望在2018年11月底）对许多语言进行了预训练。
n Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. In other words, a batch of 64 sequences of length 512 is much more expensive than a batch of 256 sequences of length 128. The fully-connected/convolutional cost is the same, but the attention cost is far greater for the 512-length sequences. Therefore, one good recipe is to pre-train for, say, 90,000 steps with a sequence length of 128 and then for 10,000 additional steps with a sequence length of 512. The very long sequences are mostly needed to learn positional embeddings, which can be learned fairly quickly. Note that this does require generating the data twice with different values of max_seq_length.	较长的序列极其耗费资源，因为attention 是序列长度的平方。换句话说，一批长度为512的64个序列比一批256个长度为128的序列要贵得多。全连接/卷积代价是相同的，但对于512个长度的序列，注意代价要大得多。因此，一个好的方法是预先训练，比如90000步，序列长度为128，然后是10000个额外步骤，序列长度为512。很长的序列大多需要学习位置嵌入，这可以相当快地学习。请注意，这确实需要使用不同的max_seq_length值生成两次数据。
n If you are pre-training from scratch, be prepared that pre-training is computationally expensive, especially on GPUs. If you are pre-training from scratch, our recommended recipe is to pre-train a BERT-Base on a single preemptible Cloud TPU v2, which takes about 2 weeks at a cost of about $500 USD (based on the pricing in October 2018). You will have to scale down the batch size when only training on a single Cloud TPU, compared to what was used in the paper. It is recommended to use the largest batch size that fits into TPU memory.	如果你是从头开始进行预训练，准备好预训练的计算是非常消耗资源的，特别是在gpu上。如果您是从头开始进行预训练，我们推荐的配方是基于单一的抢占式Cloud TPU v2进行预训练，大约需要2周，成本约为500美元（基于2018年10月的价格）。与论文中使用的内容相比，当只在单个云TPU上进行训练时，您必须缩小批处理大小。建议使用适合于TPU内存的最大批处理大小。
Pre-training data	预训练数据
We will not be able to release the pre-processed datasets used in the paper. For Wikipedia, the recommended pre-processing is to download the latest dump, extract the text with WikiExtractor.py, and then apply any necessary cleanup to convert it into plain text.	我们将无法发布本文中使用的预处理数据集。对于维基百科，建议的预处理是下载最新的转储，用WikiExtractor.py提取文本，然后应用任何必要的清理来将其转换为纯文本。
Unfortunately the researchers who collected the BookCorpus no longer have it available for public download. The Project Guttenberg Dataset is a somewhat smaller (200M word) collection of older books that are public domain.	不幸的是，收集了图书语料库的研究人员已经不再可以公开下载它了。 Guttenberg 项目数据集是一个较小的（2亿字）的公共领域的旧书集合。
Common Crawl is another very large collection of text, but you will likely have to do substantial pre-processing and cleanup to extract a usable corpus for pre-training BERT.	普通爬虫是另一个非常大的文本集合，但您可能需要做大量的预处理和清理，以提取一个可用的语料库，用于预训练BERT。
Learning a new WordPiece vocabulary	学习一个新的词典
This repository does not include code for learning a new WordPiece vocabulary. The reason is that the code used in the paper was implemented in C++ with dependencies on Google's internal libraries. For English, it is almost always better to just start with our vocabulary and pre-trained models. For learning vocabularies of other languages, there are a number of open source options available. However, keep in mind that these are not compatible with our tokenization.py library:	这个仓库不包括用于学习新的单词片段词汇表的代码。原因是本文中使用的代码是在C++中实现的，并且依赖于谷歌的内部库。对于英语来说，最好从词汇和预先训练的模型开始。对于学习其他语言的词汇表，有许多开源的选择。但是，请记住，这些与我们的tokenization.py库不兼容：
Google's SentencePiece library tensor2tensor's WordPiece generation script Rico Sennrich's Byte Pair Encoding library	谷歌的句子库 tensor2tensor的文字片生成脚本 Ricosennrich的字节对编码库
Using BERT in Colab	在Colab中使用BERT
If you want to use BERT with Colab, you can get started with the notebook "BERT FineTuning with Cloud TPUs". At the time of this writing (October 31st, 2018), Colab users can access a Cloud TPU completely for free. Note: One per user, availability limited, requires a Google Cloud Platform account with storage (although storage may be purchased with free credit for signing up with GCP), and this capability may not longer be available in the future. Click on the BERT Colab that was just linked for more information.	如果你想在Colab中的使用BERT，你可以从“BERT FineTuning with Cloud TPUs”开始。在撰写本文时（2018年10月31日），Colab用户可以完全免费访问云TPU。注意：每个用户一个，可用性有限，需要一个带有存储的谷歌云平台帐户（尽管存储可以免费购买以注册GCP），而且这种功能在未来可能不再可用。点击链接BERT Colab以获取更多信息。

常见问题解答

FAQ

常见问题解答

Is this code compatible with Cloud TPUs? What about GPUs?

Yes, all of the code in this repository works out-of-the-box with CPU, GPU, and Cloud TPU. However, GPU training is single-GPU only.

此代码与云TPU兼容吗？GPU呢？

是的，这个存储库中的所有代码都可以与CPU、GPU和Cloud TPU一起运行。然而，GPU训练只是单一的GPU训练。

I am getting out-of-memory errors, what is wrong?

See the section on out-of-memory issues for more information.

我的内存溢出了，怎么回事？

有关更多信息，请参阅有关内存不足问题的一节。

Is there a PyTorch version available?

There is no official PyTorch implementation. However, NLP researchers from HuggingFace made a PyTorch version of BERT available which is compatible with our pre-trained checkpoints and is able to reproduce our results. We were not involved in the creation or maintenance of the PyTorch implementation so please direct any questions towards the authors of that repository.

有PyTorch版本的吗？

没有正式的PyTorch实现。然而，来自HuggingFace的NLP研究人员提供了一个PyTorch版本的BERT，它与我们预先训练的检查点兼容，并能够重现我们的结果。我们没有参与PyTorch实现的创建或维护，所以相关问题请联系作者。

Is there a Chainer version available?

There is no official Chainer implementation. However, Sosuke Kobayashi made a Chainer version of BERT available which is compatible with our pre-trained checkpoints and is able to reproduce our results. We were not involved in the creation or maintenance of the Chainer implementation so please direct any questions towards the authors of that repository.

有Chainer 版本的吗？

没有正式的连锁版本实现。

然而，Sosuke Kobayashi 提供了Chainer version of BERT版本，它与我们预先训练的检查点兼容，并且能够重现我们的结果。我们没有参与Chainer 的实现或维护，因此如有任何问题请咨询该存储库的作者。

Will models in other languages be released?

Yes, we plan to release a multi-lingual BERT model in the near future. We cannot make promises about exactly which languages will be included, but it will likely be a single model which includes most of the languages which have a significantly-sized Wikipedia.

还会发布其他语言中的模型吗？

是的，我们计划在不久的将来发布一个多语言的BERT模型。

我们不能承诺确切地说将包括哪些语言，但它很可能是一个单一的模型，其中包含了大多数有大量维基百科的语言。

Will models larger than BERT-Large be released?

So far we have not attempted to train anything larger than BERT-Large. It is possible that we will release larger models if we are able to obtain significant improvements.

会发布比BERT-Large还要大的版本吗？

到目前为止，我们还没有尝试训练任何比BERT-Large更大的东西。如果我们能够获得显著的改进，我们可能会发布更大的模型。

What license is this library released under?

All code and models are released under the Apache 2.0 license. See the LICENSE file for more information.

这个库是在什么许可下发布的？

所有的代码和模型都是在Apache 2.0许可下发布的。有关更多信息，请参见许可证文件。

How do I cite BERT?

For now, cite the Arxiv paper:

@article{devlin2018bert,

title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},

author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},

journal={arXiv preprint arXiv:1810.04805},

year={2018}

}

If we submit the paper to a conference or journal, we will update the BibTeX.

我怎么引用BERT？

现在，请引用Arxiv的论文：

@article{devlin2018bert,

标题={BERT：对语言理解的深度双向Transformers 的预训练}，

作者={德夫林，雅各布和张，明伟和李，肯顿和图塔诺娃，克里斯汀娜}，

期刊={arXiv预印本arXiv：1810.04805}， year={2018} }

如果我们向会议或期刊提交论文，我们将更新BibTeX。

Disclaimer

This is not an official Google product.

免责声明

这不是谷歌的官方产品。

Contact information

For help or issues using BERT, please submit a GitHub issue.

For personal communication related to BERT, please contact Jacob Devlin ([email protected]), Ming-Wei Chang ([email protected]), or Kenton Lee ([email protected]).

联系方式

对于使用BERT的帮助或问题，请提交一个GitHub问题。有关与比特有关的个人沟通，请联系雅各布·德夫林（[email protected]）、张明伟（[email protected]）或李肯顿（[email protected]）。

About

TensorFlow code and pre-trained models for BERT

关于

针对BERT的TensorFlow 代码和预训练模型

arxiv.org/abs/1810.04805

Topics主题

nlp natural-language-processing google tensorflow natural-language-understanding

Resources资源 Readme

License许可证 Apache-2.0 license

Stars星星数 32.4k stars

Watchers浏览次数 977 watching

Forks叉子 8.9k forks

Releases发行No releases published

Packages软件包No packages published

Contributors 28 贡献者

+ 17 contributors

Languages编程语言Python76.3% Jupyter Notebook23.7%

Footer navigation导航

Terms条款
Privacy隐私政策
Security安全
Status状态
Docs文档
Contact GitHub联系GitHub
Pricing价格
API接口
Training训练
Blog博客
About关于

标签：bert,Bert,训练,BERT,--,Doc,py,DIR
From： https://www.cnblogs.com/zhangdezhang/p/16880154.html

Bert_Doc BERT文档中英文对照版

README.md https://github.com/google-research/bert

表格

发布日志

BERT简介

仓库内容介绍

BERT的使用

分类任务

问答任务

内存不足

预训练注意事项

常见问题解答

Topics主题

Resources资源 Readme

License许可证 Apache-2.0 license

Stars星星数 32.4k stars

Watchers浏览次数 977 watching

Forks叉子 8.9k forks

Releases发行No releases published

Packages软件包No packages published

Contributors 28 贡献者

Languages编程语言Python76.3% Jupyter Notebook23.7%

Footer 脚注© 2022 GitHub, Inc.

Footer navigation导航

相关文章

赞助商

阅读排行