BERT 的预训练过程是如何完成的,在预训练过程中,采用了哪两种任务?
本次实战是用 SQuAD 数据集微调 BERT, 来完成我们的问答任务,你能否用 IMDB 影评数据集来微调 BERT,改进 BERT 的结果准确率?
一、BERT 简介
BERT 全称 Bidirectional Encoder Representations from Transformers,是 Google 在2018 年提出来的,核心架构是多层 Transformer 编码器,引入了 Masked Language Model(MLM)和 Next Sentence Prediction(NSP)两个任务来训练模型。对于每个任务,可以通过在预训练模型的顶部添加一些额外的层来微调模型。
微调 BERT 需要用到 HuggingFace 组件,建议先学习参考:HuggingFace 核心组件及应用实战
二、BERT 实战:原生 BERT 完成问答任务
我们用 Google 原生发布的 BERT 去做问答任务,看看它效果如何。完成问答任务步骤如下:
下载含有问题任务头的原始版 BERT
直接用 BERT 做推理,回答问题
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
import numpy as np
# Set the random seed for PyTorch and NumPy
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
question, text = "What is the capital of China?", "The capital of China is Beijing."
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
answer_start_index = torch.argmax(outputs.start_logits)
answer_end_index = torch.argmax(outputs.end_logits) + 1
predict_answer_tokens = inputs['input_ids'][0][answer_start_index:answer_end_index]
predicted_answer = tokenizer.decode(predict_answer_tokens)
print("What is the capital of China?", predicted_answer)
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
What is the capital of China? the
其中,我们的问题是:What is China's capital?
并提供的上下文: The capital of China is Beijing.
BERT 回答是: the
