概览
- BLEU:基于比较词/短语的重叠比例,关注precision
- Rouge:基于比较词/短语的重叠比例,关注recall
- Meteor:基于比较词/短语的重叠比例,关注f1
- Distinct:
- Perplexity:
BLEU
BLEU (其全称为Bilingual Evaluation Understudy), 其意思是双语评估替补。所谓Understudy (替补),意思是代替人进行翻译结果的评估。尽管这项指标是为翻译而发明的,但它可以用于评估一组自然语言处理任务生成的文本。
计算公式:
举例:
主流的计算BLEU的python库有nltk和sacrebleu,计算结果的不同源自使用了不同的smooth算法。
from sacrebleu.tokenizers.tokenizer_zh import TokenizerZh
from sacrebleu.tokenizers.tokenizer_13a import Tokenizer13a
from sacrebleu import BLEU
import nltk.translate.bleu_score as nltkbleu
def tokenizer(s, lang):
if lang == "zh":
return TokenizerZh()(s).split(" ")
else:
return Tokenizer13a()(s).split(" ")
def sacre_bleu(refs, pred, n):
bleu = BLEU(lowercase=True, tokenize="zh", max_ngram_order=n, effective_order=True)
score = bleu.sentence_score(references=refs, hypothesis=pred).score
print(score)
def nltk_bleu(refs, pred, n):
"""
一般smoothing_function选择默认即可;
默认n=4
"""
refs = [tokenizer(ref, "zh") for ref in refs]
pred = tokenizer(pred, "zh")
weights = [1 / n for _ in range(n)]
score = nltkbleu.sentence_bleu(
refs,
pred,
smoothing_function=nltkbleu.SmoothingFunction().method7,
weights=weights
)
print(score)
if __name__ == "__main__":
s = "你好世界"
sacre_bleu([s], s, 4)
nltk_bleu([s], s, 4)