我们知道语言模型(Language Model, LM)实际上是对一种token序列的概率分布。例如对一个句子\(S = w_1, ... , w_t\),我们要估计这一个句子的概率,计算过程如下:
\[ P(S) = P(w_1, ..., w_t) \\ =P(w_1, ..., w_{t-1}) * P(w_t|w_1, ..., w_{t-1}) \\ =P(w_1, ..., w_{t-2}) * P(w_{t-1}|w_1, ..., w{t-2}) * P(w_t|w_1, ..., w_{t-1}) \\ =P(w_1, w_2) * P(w_3|w_1, w_2) * ...* P(w_{t-1}|w_1, ..., w{t-2}) * P(w_t|w_1, ..., w_{t-1}) \\ =P(w_1) * P(w_2|w_1) * P(w_3|w_1, w_2) ...* P(w_{t-1}|w_1, ..., w{t-2}) * P(w_t|w_1, ..., w_{t-1}) \]N-gram模型建立在此基础之上[1][2],本文主要参考清华大学孙茂松老师的课程讲义,讨论N-gram如何工作。
\[p(x_i \mid x_{1:i-1}) = p(x_i \mid x_{i-(n-1):i-1}). \]N-gram模型。在一个n-gram模型中,关于\(x_{i}\)的预测只依赖于最后的 \(n-1\) 个字符 \(x_{i−(n−1):i−1}\) ,而不是整个历史。
具体来说,unigram, bigram和trigram如下:
unigram---The 0 order Markov Model---$ P(w_i) $
bigram---The first order Markov Model---$ P(w_i | w_{i-1}) $
trigram---The second order Markov Model---$ P(w_i | w_{i-1}, w_{i-2}) $
例如bigram,实际上是一个一阶马尔可夫模型。
举个语音识别中的例子来说,我们要估计两句话的概率:"prepare for leap in the dark","prepare for lip in the dark",使用BNS语料库,其中包含100,106,008个单词。
我们得到每个单词/词组在语料库中的出现次数如下:
COUNT(prepare)=3023
COUNT(for)=899331
COUNT(leap)=1045
COUNT(in)=1970532
COUNT(the)=2165569
COUNT(dark)=13489
COUNT(lip)=1592
COUNT(prepare for)=528
COUNT(for leap)=1
COUNT(leap in)=100
COUNT(in the)=535036
COUNT(the dark)=3668
COUNT(for lip)=2
COUNT(lip in)=25
单词计算得到的probabilities如下:
P(prepare)= 0.000030
P(for)=0.0090
P(leap)=0.00001
P(in)=0.020
P(the)=0.022
P(dark)=0.00013
P(lip)=0.000016
词组计算得到的probabilities如下:
P(prepare for)= 0.0000053
P(for leap)=0.00000001
P(leap in)=0.000001
P(in the)= 0.0053
P(the dark)=0.000037
P(for lip)= 0.00000002
P(lip in)= 0.00000025
词组出现的频率普遍比单词出现的频率要小很多,符合直觉。
bigram在计算概率时,当前词考虑前一个词出现的概率,因此,我们首先使用bigram模型对"prepare后接for"这一组词的概率做计算:
\[P(for|prepare) = \frac{P(prepare, for)}{P(prepare)} = \frac{Count(prepare, for)/N}{Count(prepare)/N} = \frac{Count(prepare, for)}{Count(prepare)} = \frac{528}{3023} = 0.17 \]如此,bigram计算得到句子中的一系列条件概率如下:
P(for|prepare)=0.17
P(leap|for)=0.0000011
P(in|leap)=0.096
P(the|in)=0.27
P(dark|the)=0.0017
P(lip|for)=0.0000022
P(in|lip)=0.016
最后,分别用unigram和bigram计算句子的概率如下:
S1= “prepare for leap in the dark”
S2= “prepare for lip in the dark”
unigram:
\[P(S1) = P(prepare) * P(for) * P(leap) * P(in) * P(the) * P(dark) \\ = 0.000030*0.0090*0.00001*0.02*0.022*0.00013 \\ = 1.54*e-19 \]\[P(S2) = P(prepare) * P(for) * P(lip) * P(in) * P(the) * P(dark) \\ = 0.000030*0.0090*0.000016*0.02*0.022*0.00013 \\ = 2.46*e-19 \]bigram:
\[P(S1) = P(prepare) * P(for | prepare) * P(leap | for) * P(in | leap) * P(the | in) * P(dark | the) \\ = 0.000030* 0.17*0.0000011*0.096*0.27*0.0017 \\ = 2.47*e-16 \]\[P(S2) = P(prepare) * P(for | prepare) * P(lip | for) * P(in | lip) * P(the | in) * P(dark | the) \\ = 0.000030* 0.17*0.0000022*0.016*0.27*0.0017 \\ = 8.24*e-17 \]我们可以发现,
- unigram估计下\(P(S1) > P(S2)\),而在bigram估计下\(P(S1) < P(S2)\),说明bigram模型比unigram模型更为有效;
- \(P(bigram(S1))/P(unigram(S1))=1604\)
\(P(bigram(S2))/P(unigram(S2))=340\)
bigram对句子估计得也更为精准。
Ref:
[1] https://stanford-cs324.github.io/winter2022/
[2] https://github.com/datawhalechina/so-large-lm