首页 > 其他分享 >3.Generative_models_for_discrete_data

3.Generative_models_for_discrete_data

时间:2022-11-26 19:46:38浏览次数:53  
标签:plt models cfrac alpha tag theta Generative mathcal data

目录

3 Generative models for discrete data

3.1 Bayesian concept learning

The psychological research has shown that people can learn concepts from positive examples alone.

3.1.1 Prior

The prior is the mechanism by which background knowledge can be brought to bear on a problem. Without this, rapid learning (i.e., from small samples sizes) is impossible.

首先是将背景知识用于解决问题的机制。如果没有这一点,快速学习(即快速学习)(即从较小的样本量)是不可能的。

3.1.2 Posterior

The posterior is the likelihood times the prior, normalized.

\[\begin{align*} p(h|\mathcal{D}) &= \cfrac{p(\mathcal{D}|h)p(h)}{\sum_{h'\in \mathcal{H}}p(\mathcal{D},h')}\\ &=\cfrac{p(h)\mathbb{I}(\mathcal{D}\in h)/|h|^N}{\sum_{h'\in \mathcal{D}}p(h')\mathbb{I}(\mathcal{D}\in h')/|h'|^N} \tag{3.1} \end{align*} \]

In general, when we have enough data, the posterior \(p(h|\mathcal{D})\) becomes peaked on a single concept, namely the Maximum A Posteriori (MAP) estimate.

\[p(h|\mathcal{D}) \rightarrow \delta_{\hat{h}^{MAP}}(h) \tag{3.2} \]

Where \(\hat{h}^{MAP} = {\rm argmax}_h p(h|\mathcal{D})\) is the posterior mode, and \(\delta\) is the Dirac measure defined by

\[\begin{align*} \delta_x(A) = \left\{\begin{matrix} 1 & {\rm if}\ x\in A\\ 0 & {\rm if}\ x\notin A \end{matrix}\right. \end{align*} \tag{3.3} \]

\[\hat{h}^{MAP} = \sideset{}{}{\rm argmax}_h p(\mathcal{D}|h)p(h) = \sideset{}{}{\rm argmax}_h [\log p(\mathcal{D}|h) +\log p(h)] \tag{3.4} \]

Since the likelihood term depends exponentially on N, and the prior stays constant, as we get more and more data, the MAP estimate converges towards the maximum likelihood estimate or MLE:

\[\hat{h}^{MLE} = \sideset{}{}{\rm argmax}_h\ p(\mathcal{D}|h)=\sideset{}{}{\rm argmax}_h\ {\rm log}\ p(\mathcal{D}|h) \tag{3.5} \]

3.2.3 Posterior predictive distribution

The posterior is our internal belief state about the world. The way to test if our beliefs are justified is to use them to predict objectively observable quantities (this is the basis of the scientific method). Specifically, the posterior predictive distribution is given by

\[p(\widetilde{x}\in C|\mathcal{D})=\sum_h p(y=1|\widetilde{x},h)p(h|\mathcal{D}) \tag{3.6} \]

This is just a weighted average of the predictions of each individual hypothesis and is called Bayes model averaging.

image-20221119233038890

Python 实现原理

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec

plt.rc('font', family='Times New Roman')

data = {'powers of 4': [4**(i+1) for i in range(3)], 'powers of 2': [2**(i+1) for i in range(6)], 'ends in 6': [6+i*10 for i in range(10)], 'squares': [(i+1)**2 for i in range(10)], 'even': [2*(i+1) for i in range(50)], 'mult of 8': [8*(i+1) for i in range(12)],'mult ofi 4': [4*(i+1) for i in range(24)], 'all': [i+1 for i in range(100)], 'powers of 2 - {32}': [2,4,8,16,64], 'powers of 2+ {37}': [2,4,8,16,32,37,64]}

gs = gridspec.GridSpec(15,12)
fig = plt.figure()
ax1 = plt.subplot(gs[0:4, 0:10])
ax2 = plt.subplot(gs[4:15, 0:10])
ax3 = plt.subplot(gs[4:15, 10:12])

plt.sca(ax2)
i=9
for key in data:

    plt.plot(data[key], [i for j in range(len(data[key]))],'.', color = '#374888')
    i-=1
plt.grid(linestyle='-', axis='y')
plt.axis([0,100.5,-0.5,9.5])
plt.yticks([i for i in range(10)],list(data.keys())[::-1])

plt.sca(ax3)
p_h = [0,0,0.05,0.1,0.2,0.25,0.28,0.3,0.38,0.5]
y_h = [i for i in range(10)]
plt.plot(p_h,y_h,'-',color='#374888')
plt.plot(p_h,y_h,'.',color='#374888')
plt.axis([-0.05,1,-0.5,9.5])
plt.yticks([])
ax3.spines['top'].set_visible(False)
ax3.spines['right'].set_visible(False)

plt.sca(ax1)
x_dist = np.arange(1,101,1)
y_pp = []
ph = p_h.reverse()
for i in range(len(x_dist)):
    y_pp_temp = 0
    j=0
    for key in data:
        if x_dist[i] in data[key]:
            y_pp_temp += 1/len(data[key])* (p_h[j])
        j+=1
    y_pp.append(y_pp_temp)
plt.bar(x_dist,y_pp, color='#282A65')
plt.xticks([])
plt.ylim([0,0.5])
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)

plt.tight_layout()
plt.show()
3.4 Posterior predictive distribution

When we have a small and/or ambiguous dataset, the posterior \(p(h|\mathcal{D})\) is vague, which induces a broad predictive distribution. However, once we have “figured things out”, the posterior becomes a delta function centered at the MAP estimate. In this case, the predictive distribution

\[p(\widetilde{x}\in C|\mathcal{D})= \sum_{h}p(\widetilde{x}|h)\delta_{\hat{h}}(h) = p(\widetilde{x}|\hat{h}) \tag{3.7} \]

This is called a plug-in approximation to the predictive density and is very widely used, due to its simplicity.

3.2 The beta-binomial model

3.2.1 Likelihood

Suppose \(X_i \sim {\rm Ber}(\theta)\), where \(X_i={1,0}\) and \(\theta \in [0,1]\) is the rate parameter. If the data are iid, the likelihood has the form

\[p(\mathcal{D|\theta}) = \theta^{N_1}(1-\theta)^{N_0} \tag{3.8} \]

Where we have \(N_1 = \sum_{i=1}^N \mathbb{I}(x_i=1)\) and \(N_0 = \sum_{i=1}^N \mathbb{I}(x_i=0)\)

Suppose the data consists of the count of the number of \(N_1\) observed in a fixed number \(N = N_1 + N_0\) of trials. In this case, we have \(N_1 ∼ Bin(N, θ)\), where Bin represents the binomial distribution, which has the following pmf:

\[{\rm Bin}(k|n,\theta) \triangleq \binom{n}{k}\theta^{k}(1-\theta)^{n-k} \tag{3.9} \]

3.2.2 Prior

\[p(\theta) \propto p(\mathcal{D}|\theta)p(\theta) = \theta^{N_1}(1-\theta)^{N_0}\theta^{\gamma_1}(1-\theta)^{\gamma_2} = \theta^{N_1+\gamma_1}(1-\theta)^{N_0+\gamma_2} \tag{3.10} \]

When the prior and the posterior have the same form, we say that the prior is a conjugate prior for the corresponding likelihood.

In the case of the Bernoulli, the conjugate prior is the beta distribution

\[{\rm Beta}(\theta|a,b) \propto \theta^{a-1}(1-\theta)^{b-1} \tag{3.11} \]

The parameters of the prior are called hyper-parameters.


Beta Distribution

The beta distribution has support over the interval \([0, 1]\)

\[{\rm Beta}(x|a,b) = \cfrac{1}{B(a,b)}x^{a-1}(1-x)^{b-1} \tag{3.12} \]

\[B(a,b) \triangleq \cfrac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} \tag{3.13} \]

mean: \(\cfrac{a}{a+b}\), mode: \(\cfrac{a-1}{a+b-2}\), var: \(\cfrac{ab}{(a+b)^2(a+b+1)}\)

Where \(\Gamma(x)\) is the gamma function.

\[\Gamma(x) \triangleq \int_0^\infty u^{x-1}e^{-u}du \tag{3.14} \]


3.2.3 Posterior

If we multiply the likelihood by the beta prior we get the following posterior:

\[p(\theta| \mathcal{D}) \propto {\rm Bin}(N_1|\theta,N_0+N_1){\rm Beta}(\theta|a,b) \propto {\rm Beta}(\theta|N_1+a,N_0+b) \tag{3.15} \]

image-20221120232058151

Python 实现原理

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
plt.rc('font', family='Times New Roman')

x= np.linspace(0,1,100)
prior_Be = stats.beta(a=2,b=2).pdf(x)
# Ber -> N_1 = 3 , N2 = 17
likeli_Be = stats.beta(a=4,b=18).pdf(x)
posteri_Be = stats.beta(a=5,b=19).pdf(x)

prior_Be2 = stats.beta(a=5,b=2).pdf(x)
# Ber -> N_1 = 11 , N2 = 13
likeli_Be2 = stats.beta(a=12,b=14).pdf(x)
posteri_Be2 = stats.beta(a=16,b=15).pdf(x)

fig,ax = plt.subplots(1,2, figsize=(10,4),)

plt.sca(ax[0])
plt.plot(x,prior_Be, color='#C73834',label='prior Be(2,2)')
plt.plot(x,likeli_Be, 'k--', label='lik Be(4,18)')
plt.plot(x,posteri_Be,'-.', color='#374888', label='post Be(5,19)')
plt.legend()
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
plt.axis([0.01,1,0.01,6])

plt.sca(ax[1])
plt.plot(x,prior_Be2, color='#C73834',label='prior Be(5,2)')
plt.plot(x,likeli_Be2, 'k--', label='lik Be(12,14)')
plt.plot(x,posteri_Be2,'-.', color='#374888', label='post Be(16,15)')
plt.legend()
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
plt.axis([0.01,1,0.01,4.5])

plt.tight_layout()
plt.savefig('./Figure/3.6Beta-binomial分布.png', dpi=600)
plt.show()
3.6Beta-binomial分布

Updating the posterior sequentially is equivalent to updating in a single batch. Suppose we have two data sets \(\mathcal{D_a}\) and \(\mathcal{D_b}\) with sufficient statistics \(N_1^a\), \(N_0^a\) and \(N_1^b\) , \(N_0^b\) . Let \(N_1 = N_1^a + N_1^b\) and \(N_0 = N_0^a + N_0^b\) be the sufficient statistics of the combined datasets. In batch mode we have

\[p(\theta| \mathcal{D_a,D_b}) \propto {\rm Bin}(N_1|\theta, N_1+N_0){\rm Beta}(\theta|a,b) \propto {\rm Beta}(\theta|N_1+a,N_0+b) \tag{3.16} \]

In sequential mode, we have

\[\begin{align*} p(\theta|\mathcal{D_a,D_b}) &\propto p(\mathcal{D_b|\theta})p(\mathcal{\theta|D_a}) \\ & \propto {\rm Bin}(N_1^b|\theta, N_1^b+N_0^b){\rm Beta}(\theta|N_1^a+a,N_0^a+b) \\ & \propto {\rm Beta}(\theta|N_1^a+N_1^b+a, N_0^a+N_0^b+b) \tag{3.17} \end{align*} \]

3.2.3.1 Posterior mean and mode

The MAP estimate is given by

\[\hat{\theta}_{MAP} = \cfrac{a+N_1-1}{a+b+N-2} \tag{3.18} \]

If we use a uniform prior, then the MAP estimate reduces to the MLE, which is just the empirical fraction of heads:

\[\hat{\theta}_{MLE} = \cfrac{N_1}{N} \tag{3.19} \]

The posterior mean is given by,

\[\overline{\theta} = \cfrac{a+N_1}{a+b+N} \tag{3.20} \]

Let \(\alpha_0 = a+b\), the posterior mean is given by

\[\mathbb{E}[\theta|\mathcal{D}] = \cfrac{\alpha_0m_1+N_1}{N+\alpha_0} = \cfrac{\alpha_0}{N+\alpha_0}m_1+\cfrac{N}{N+\alpha_0}\cfrac{N_1}{N} = \lambda m_1 + (1-\lambda)\hat{\theta}_{MLE} \tag{3.21} \]

where \(\lambda = \cfrac{\alpha_0}{N+\alpha_0}\) is the ratio of the prior to posterior equivalent sample size. The weaker the prior, the smaller is \(λ\), and hence the closer the posterior mean is to the MLE.

3.2.3.2 Posterior variance

The variance of the Beta posterior is given by

\[{\rm var}[\theta|\mathcal{D}] = \cfrac{(a+N_1)(b+N_0)}{(a+N_1+b+N_0)^2(a+N_1+b+N_0+1)} \tag{3.22} \]

We can simplify this formidable expression in the case that \(N ≫ a, b\), to get

\[{\rm var}[\theta|\mathcal{D}] \approx \cfrac{N_1+N_0}{NNN} = \cfrac{\hat{\theta}(1-\hat{\theta})}{N} \tag{3.23} \]

where \(\hat{\theta}\) is the MLE. Hence the “error bar” in our estimate (i.e., the posterior standard deviation), is given by

\[\sigma = \sqrt{{\rm var}[\theta|\mathcal{D}]} \approx \sqrt{\cfrac{\hat{\theta}(1-\hat{\theta})}{N}} \tag{3.24} \]

3.2.4 Posterior predictive distribution

Consider predicting the probability of heads in a single future trial under a \({\rm Beta}(a, b)\) posterior. We have

\[\begin{align*} p(\widetilde{x}=1|\mathcal{D}) &= \int_0^1 p(x=1|\theta)p(\theta|\mathcal{D})d\theta \\ & = \int_0^1 \theta {\rm Beta}(\theta|a,b)d\theta = \mathbb{E}[\theta|\mathcal{D}] = \cfrac{a}{a+b} \tag{3.25} \end{align*} \]

The mean of the posterior predictive distribution is equivalent (in this case) to plugging in the posterior mean parameters: \(p(\widetilde{x}|\mathcal{D}) = {\rm Ber}(\widetilde{x}|\mathbb{E}[\theta|\mathcal{D}])\)

3.2.4.1 Overfitting and the black swan paradox

Suppose instead that we plug-in the \(MLE\), i.e., we use \(p(\widetilde{x}|\mathcal{D}) ≈ {\rm Ber}(\widetilde{x}|\hat{\theta}_{MLE})\). Unfortunately, this approximation can perform quite poorly when the sample size is small. This is called the zero count problem or the sparse data problem, and frequently occurs when estimating counts from small amounts of data.

The zero-count problem is analogous to a problem in philosophy called the black swan paradox.


举个栗子

假定我们使用均匀分布作为先验,那么在先验的Beta分布中, \(a=b=1\),在这种情况下,插入后验平均值给出了拉普拉斯的继承规则(Laplace's rule of succession

\[p(\widetilde{x}=1|\mathcal{D}) = \cfrac{N_1+1}{N_1+N_0+2} \tag{3.26} \]

This justifies the common practice of adding 1 to the empirical counts, normalizing and then plugging them in, a technique known as add-one smoothing.

Plugging in the MAP parameters would not have this smoothing effect, since the mode has the form \(\hat{\theta} = \cfrac{N_1+a-1}{N+a+b-2}\), which becomes the MLE if \(a=b=1\)


3.2.4.2 Predicting the outcome of multiple future trials

Suppose now we were interested in predicting the number of heads, \(x\), in \(M\) future trials. This is given by

\[\begin{align*} p(x|\mathcal{D},M) &= \int_0^1{\rm Bin}(x|\theta,M){\rm Beta}(\theta|a,b)d\theta \\ &=\binom{M}{x}\cfrac{1}{B(a,b)}\int_0^1 \theta^x(1-\theta)^{M-x}\theta^{a-1}(1-\theta)^{b-1}d\theta \tag{3.27} \end{align*} \]

We recognize the integral as the normalization constant for a \({\rm Beta}(a+x, M−x+b)\) distribution.

\[\int_0^1\theta^x(1-\theta)^{M-x}\theta^{a-1}(1-\theta)^{b-1}d\theta = B(x+a,M-x+b) \tag{3.28} \]

Thus we find that the posterior predictive is given by the following, known as the (compound) beta-binomial distribution:

\[Bb(x|a,b,M) \triangleq \binom{M}{x} \cfrac{B(x+a,M-x+b)}{B(a,b)} \tag{3.29} \]

This distribution has the following mean and variance

\[\mathbb{E}[x] = M\cfrac{a}{a+b}, \quad {\rm var}[x]= \cfrac{Mab}{(a+b)^2} \cfrac{(a+b+M)}{a+b+1} \tag{3.30} \]


Given \(M=10\), we start with a \({\rm Beta}(2,2)\) prior, and (b) plot the posterior predictive density after seeing \(N_1 = 3\) heads and \(N_0 = 17\) tails. (b) plots a plug-in approximation using a MAP estimate. We see that the Bayesian prediction has longer tails, spreading its probablity mass more widely, and is therefore less prone to overfitting and blackswan type paradoxes.

image-20221121154613905


Python 实现原理

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
plt.rc('font', family='Times New Roman')

'''
prior ~ Beta(2,2)
observation ~ N1=3 ,N0=17
p(\theta | D) ~ Beta(a+N1,b+N0) -> Beta(5,19)
'''
n, a, b = 10,5,19
x = np.arange(0,10,1)
ppd_1=stats.betabinom(n, a, b).pmf(x)

'''
\theta_MAP = \frac{a+N1-1}{a+b+N-2} = 4/23
x ~ Bin(n,theta)
'''
n, p= 10, 4/23
ppd_2 = stats.binom.pmf(x,n=n,p=p)

_, ax = plt.subplots(1,2,figsize=(10,4))

plt.sca(ax[0])
plt.bar(x, ppd_1, color='#282A65')
plt.axis([-0.8,10.8,0,0.35])
plt.xlabel('',fontsize=11)
plt.ylabel('',fontsize=11)
plt.title('Posterior Predictive', fontsize=12)
plt.grid(linestyle='--')
plt.xticks(np.arange(0,11,1))

plt.sca(ax[1])
plt.bar(x, ppd_2, color='#282A65')
plt.axis([-0.8,10.8,0,0.35])
plt.xlabel('',fontsize=11)
plt.ylabel('',fontsize=11)
plt.title('Plug-in Predictive', fontsize=12)
plt.grid(linestyle='--')
plt.xticks(np.arange(0,11,1))

plt.tight_layout()
plt.savefig('./Figure/3.7后验预测分布.png',dpi=600)
plt.show()
3.7后验预测分布

3.3 The Dirichlet- multinomial model

3.3.1 Likelihood

Suppose we observe \(N\) dicerte rolls, \(\mathcal{D} = \{x_1,...,x_N\}\), where \(x_i ∈ \{1,\ldots,K\}\). If we assume the data is iid, the likelihood has the form

\[p(\mathcal{D|\theta}) = \prod_{k=1}^K\theta_k^{N_k} \tag{3.31} \]

Where \(N_k = \sum_{i=1}^N\mathbb{I}(y_i=k)\) is the number of times event k occured.

3.3.2 Prior

Since the parameter vector lives in the \(K\)-dimensional probability simplex, we need a prior that has support over this simplex. Ideally it would also be conjugate. Fortunately, the Dirichlet distribution satisfies both criteria. So we will use the following prior:

\[\mathrm{Dir}(\boldsymbol{\theta}|\alpha) = \cfrac{1}{B(\alpha)} \prod_{k=1}^K\theta_k^{\alpha_k-1}\mathbb{I}(\mathrm{x}\in S_k) \tag{3.32} \]


Dirichlet distribution

A multivariate generalization of the beta distribution is the Dirichlet distribution , which has support over the probability simplex,

\[S_K = {{\rm x}: 0\le x_k \le 1, \sum_{k=1}^Kx_k=1} \tag{3.33} \]

\[{\rm Dir}({\rm x}|\alpha) \triangleq \cfrac{1}{B(\alpha)}\prod_{k=1}^{K}x_k^{\alpha_k-1}\mathbb{I}(x \in S_K) \tag{3.34} \]

where \(\alpha_0 \triangleq \sum_{k=1}^K \alpha_k\)

\(\mathbb{E}[x_k]=\cfrac{\alpha_k}{\alpha_0}\), \({\rm mode}[x_k] = \cfrac{\alpha_k-1}{\alpha_0-K}\), \({\rm var}[x_k] = \cfrac{\alpha_k(\alpha_0-\alpha_k)}{\alpha_0^2(\alpha_0+1)}\)


3.3.3 Posterior

Multiplying the likelihood by the prior, we find that the posterior is also Dirichlet:

\[\begin{align*} p(\boldsymbol{\theta}|\mathcal{D}) &\propto p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta}) \\ & \propto \prod_{k=1}^K\theta_k^{N_k}\theta_k^{\alpha_k-1} = \prod_{k=1}^K \theta_k^{N_k+\alpha_k-1} \\ & = \mathrm{Dir}(\boldsymbol{\theta}|\alpha_1+N_1,\ldots,\alpha_K+N_K) \tag{3.35} \end{align*} \]

We can derive the mode of this posterior (i.e., the MAP estimate) by using calculus. However, we must enforce the constraint that \(\sum_k\theta_k = 1.^2\). We can do this by using a Lagrange multiplier. The constrained objective function, or Lagrangian, is given by the log likelihood plus log prior plus the constraint:

\[\ell(\boldsymbol{\theta},\lambda) = \sum_k N_k \mathrm{log}\theta_k + \sum_k(\alpha_k-1)\mathrm{log}\theta_k +\lambda\left(1-\sum_{k}\theta_k \right) \tag{3.36} \]

To simplify notation, we define \(N_k^{'} = \triangleq N_k+\alpha_k-1\), Taking derivatives with respect to \(λ\) yields the original constraint:

\[\cfrac{\partial \ell}{\partial \lambda} = \left( 1-\sum_k\theta_k \right) =0 \tag{3.37} \]

Taking derivatives with respect to \(θ_k\) yields

\[\cfrac{\partial \ell}{\partial \theta_k} = \cfrac{N_k^{'}}{\theta_k}-\lambda=0 \tag{3.38} \]

\[N_k^{'} = \lambda\theta_k \tag{3.39} \]

We can solve for \(λ\) using the sum-to-one constraint:

\[\sum_k N_k^{'} = \lambda\sum_k\theta_k \tag{3.40} \]

\[N+\alpha_0-K = \lambda \tag{3.41} \]

where \(\alpha_0 \triangleq \sum_{k=1}^{K}\alpha_k\) is the equivalent sample size of the prior. Thus the MAP estimate is given by

\[\hat{\theta}_k = \cfrac{N_k+\alpha_k-1}{N+\alpha_0-K} \tag{3.42} \]

If we use a uniform prior, \(\alpha_k=1\), we recover the MLE:

\[\hat{\theta}_k = \cfrac{N_k}{N} \tag{3.43} \]

3.3.4 Posterior predictive

The posterior predictive distribution for a single multinoulli trial is given by the following expression:

\[\begin{align*} p(X=j|\mathcal{D}) &=\int p(X=j|\boldsymbol{\theta})p(\boldsymbol{\theta}|\mathcal{D})d\boldsymbol{\theta} \\ & = \int p(X=j|\theta_j)\left[ \int p(\boldsymbol{\theta}_{-j},\theta_j|\mathcal{D})d\boldsymbol{\theta}_{-j} \right]d\theta_j \\ & = \int \theta_j p(\theta_j|\mathcal{D})d\theta_j = \mathbb{E}[\theta_j|\mathcal{D}] = \cfrac{\alpha_j+N_j}{\sum_k(\alpha_k+N_k)} = \cfrac{\alpha_j+N_j}{\alpha_0+N} \tag{3.44} \end{align*} \]

where \(\theta_{−j}\) are all the components of \(\boldsymbol{\theta}\) except \(θ_j\).

3.4 Naive Bayes classifiers

Specifing the class conditional distribution,\(p{(\mathbf{x}|y=c)}\). The simplest approach is to assume the features are conditionally independent given the class label. This allows us to write the class conditional density as a product of one dimensional densities:

\[p(\mathbf{x}|y=c,\boldsymbol{\theta}) = \prod_{j=1}^{\mathcal{D}}p(x_j|y=c,\boldsymbol{\theta}_{jc}) \tag{3.45} \]

Where \(\mathcal{D}\) is the number of features.

The resulting model is called a naive Bayes classifier (NBC). The model is called “naive” since we do not expect the features to be independent, even conditional on the class label.

The form of the class-conditional density depends on the type of each feature.

  • In the case of real-valued features, we can use the Gaussian distribution: \(p(\mathbf{x}|y=c,\boldsymbol{\theta}) = \prod_{j=1}^D \mathcal{N}(x_j|\mu_{jc},\sigma^2_{jc})\), where \(\mu_{jc}\) is the mean of feature \(j\) in objects of class \(c\), and \(\sigma^2_{jc}\) is its variance.
  • In the case of binary features, \(x_j=\{0,1\}\), we can use the Bernoulli distribution: \(p(\mathbf{x}|y=c,\boldsymbol{\theta}) = \prod_{j=1}^\mathcal{D}\mathrm{Ber}(x_j|\mu_{jc})\), where \(\mu_{jc}\) is the probability that feature \(j\) occurs in class \(c\). This is sometimes called the multivariate Bernoulli naive Bayes model.
  • In the case of categorical features, \(x_j \in \{1,\ldots, K\}\), we can model use the multinoulli distribution: \(p(\mathbf{x}|y=c,\boldsymbol{\theta}) = \prod_{j=1}^{\mathcal{D}} \mathrm{Cat}(x_j|\mu_{jc})\), where \(\mu_{jc}\) is a histogram over the \(K\) possible values for \(x_j\) in class \(c\).

分类分布、Categorical 分布

\[P(X=x_k|\theta_1,\theta_2,\ldots,\theta_K) = \prod_{k=1}^K\theta_k^{x_k} \]

\[\sum_{k=1}^K \theta_k=1, \quad x_k\in\{0,1\}, \quad k\in\{1,2,\ldots,K\} \]


3.4.1 Model fitting

3.4.1.1 MLE for NBC

The probability for a single data case is given by

\[p(\mathbf{x}_i,y_i|\boldsymbol{\theta}) = p(y_i|\pi)\prod_{j}p(x_{ij}|\boldsymbol{\theta}_j) = \prod_c \pi_c^{\mathbb{I}(y_i=c)} \prod_j \prod_c p(x_{ij}|\boldsymbol{\theta}_{jc})^{\mathbb{I}(y_i=c)} \tag{3.46} \]

Hence the log-likelihood is given by

\[\log p(\mathcal{D}|\boldsymbol{\theta}) = \sum_{c=1}^C N_c \log \pi_c + \sum_{j=1}^{\mathcal{D}}\sum_{c=1}^C \sum_{i:y_i=c}\log p(x_{ij}|\boldsymbol{\theta}_{jc}) \tag{3.47} \]

We see that this expression decomposes into a series of terms, one concerning \(\textcolor{red}{π}\), and DC terms containing the \(θ_{jc}\)’s. Hence we can optimize all these parameters separately.

the MLE for the class prior is given by

\[\hat{\pi}_c = \cfrac{N_c}{N} \tag{3.48} \]

where \(N_c \triangleq \sum_i \mathbb{I}(y_i=c)\) is the number of examples in class \(c\).

The MLE for the likelihood depends on the type of distribution we choose to use for each feature. For simplicity, let us suppose all features are binary, so \(x_j|y = c \sim \mathrm{Ber}(\theta_{jc})\), In this case, the MLE becomes

\[\hat{\theta}_{jc} = \cfrac{N_{jc}}{N_c} \tag{3.49} \]

3.4.1.2 Bayesian navie Bayes

The trouble with maximum likelihood is that it can overfit. ==> \(\theta_{jc}=1\)

The anthor problem is that balck swan paradox. ==> \(p(y=c|\mathbf{x},\hat{\boldsymbol{\theta}})\)

A simple solution to overfitting is to be Bayesian. For simplicity, we will use a factored prior:

\[p(\boldsymbol{\theta}) = p(\pi)\prod_{j=1}^D\prod_{c=1}^Cp(\theta_{jc}) \tag{3.50} \]

We will use a \(\mathrm{Dir}(\boldsymbol{\alpha})\) prior for \(π\) and a \(\mathrm{Beta}(β_0,β_1)\) prior for each \(\theta_{jc}\). Often we just take \(\boldsymbol{\alpha} = 1\) and \(\boldsymbol{\beta} = 1\), corresponding to add-one or Laplace smoothing.

\[p(\boldsymbol{\theta}|\mathcal{D}) = p(\boldsymbol{\pi}|\mathcal{D})\prod_{j=1}^D\prod_{c=}^{C}p(\theta_{jc}|\mathcal{D}) \tag{3.51} \]

\[p(\boldsymbol{\pi}|\mathcal{D}) = \mathrm{Dir}(N_1+\alpha_1,\ldots,N_C+\alpha_C) \tag{3.52} \]

\[p(\theta_{jc}|\mathcal{D}) = \mathrm{Beta}((N_c-N_{jc})+\beta_0, N_{jc}+\beta_1) \tag{3.53} \]

3.4.2 Using the model for prediction

\[p(y=c|\mathbf{x},\mathcal{D}) \propto p(y=c|\mathcal{D})\prod_{j=1}^{D}p(x_j|y=c,\mathcal{D}) \tag{3.54} \]

The correct Bayesian procedure is to integrate out the unknown parameters:

\[\begin{align*} p(y=c|\mathbf{x},\mathcal{D}) &\propto \left[\int \mathrm{Cat(y=c|\boldsymbol{\pi})}p(\boldsymbol{\pi}|\mathcal{D})d\boldsymbol{\pi} \right] \prod_{j=1}^{D}\left[ \int \mathrm{Ber}(x_j|y=c,\theta_{jc})p(\theta_{jc}|\mathcal{D})d\theta_{jc} \right] \tag{3.55} \end{align*} \]

\[p(y=c|\mathbf{x},\mathcal{D}) \propto \overline{\pi}_c\prod_{j=1}^D(\overline{\theta}_{jc})^{\mathbb{I}(x_j=1)}(1-\theta_{jc})^{\mathbb{I}(x_j=0)} \tag{3.56} \]

\[\overline{\theta}_{jc} = \cfrac{N_{jc}+\beta_1}{N_c+\beta_0+\beta_1} \tag{3.57} \]

\[\overline{\pi}_c = \cfrac{N_c+\alpha_c}{N+\alpha_0} \tag{3.58} \]

where \(\alpha_0 = \sum_{c}\alpha_c\)

If we have approximated the posterior by a single point, \(p(\boldsymbol{\theta}|\mathcal{D}) \approx \delta_{\hat{\boldsymbol{\theta}}}(\boldsymbol{\theta})\), where \(\hat{\boldsymbol{\theta}}\) may be the ML or MAP estimate, then the posterior predictive density is obtained by simply plugging in the parameters.

\[p(y=c|\mathbf{x},\mathcal{D}) \propto \hat{\pi}_c \prod_{j=1}^{D}\hat{\theta}_{jc}^{\mathbb{I}(x_j=1)}(1-\hat{\theta_{jc}})^{\mathbb{I}(x_j=0)} \tag{3.59} \]

3.4.3 The log-sum-exp trick

The problem is that$ p(\mathbb{x}|y = c) $ is often a very small number, especially if \(\mathbf{x}\) is a high-dimensional vector. This is because we require that \(\sum_{\mathbf{x}} p(\mathbf{x}|y) = 1\), so the probability of observing any particular high-dimensional vector is small. The obvious solution is to take logs when applying Bayes rule.

\[p(y=c|\mathbf{x}) = \cfrac{p(\mathbf{x}|y=c)p(y=c)}{\sum_{c'=1}^{C}p(y=c',\mathbf{x})} \tag{3.60} \]

\[\begin{align*} \log p(y=c|\mathbf{x}) &= \log p(\mathbf{x}|y=c)+\log p(y=c) - \log \sum_{c'=1}^C p(\mathbf{x},y=c')\\ \rm{where} \quad b_c &= \log p(\mathbf{x}|y=c)+\log p(y=c) \\ \log p(y=c|\mathbf{x}) &= b_c - \log \left [ \sum_{c'=1}^C e^{b_{c'}} \right ] \tag{3.61} \end{align*} \]

We can’t add up in the log domain. Fortunately, we can factor out the largest term, and just represent the remaining numbers relative to that.

\[\log \sum_c{e^{b_c}} = \log \left[ (\sum_c{e^{b_c-B}})e^B \right] = \log \left[ (\sum_c{e^{b_c-B}})\right]+B \tag{3.62} \]

Where \(B=\max_cb_c\), This is called the log-sum-exp trick, and is widely used.

3.4.4 Feature selection using mutual information

Since an NBC is fitting a joint distribution over potentially many features, it can suffer from overfitting. In addition, the run-time cost is \(O(D)\), which may be too high for some applications. One common approach to tackling both of these problems is to perform feature selection. The simplest approach to feature selection is to evaluate the relevance of each feature separately, and then take the top K, where K is chosen based on some tradeoff between accuracy and complexity. This approach is known as variable ranking, filtering, or screening.

\[I(X,Y) = \sum_{x_j}\sum_{y}p(x_j,y)\log \cfrac{p(x_j,y)}{p(x_j)p(y)} \tag{3.63} \]

If the features are binary, the MI can be computed as follows

\[I_j = \sum_{c} \left[ \theta_{jc}\pi_c\log\cfrac{\theta_{jc}}{\theta_j} + (1-\theta_{jc})\pi_c\cfrac{1-\theta_{jc}}{1-\theta_j} \right] \tag{3.64} \]

where \(\pi_c = p(y=c)\), \(\theta_{jc} = p(x_j=1|y=c)\), and \(\theta_j =p(x_j=1) = \sum_c\pi_c\theta_{jc}\)

标签:plt,models,cfrac,alpha,tag,theta,Generative,mathcal,data
From: https://www.cnblogs.com/zhangjianhua0728/p/16928129.html

相关文章

  • DataFrame的级联and合并操作
    DataFrame的级联and合并操作级联操作(横向或纵向的拼接)pd.concatpd.appendimportpandasaspdimportnumpyasnp pandas使用pd.concat函数,与np.concatena......
  • 使用了外键的models 结合 Modelviewset的写法
    Serializers    重写create方法即可classMemberSerializers(serializers.ModelSerializer):defcreate(self,validate):users=Member.objects.......
  • 使用Docker恢复Mysql8备份的Data数据 - 诚哥博客
    前段时间公司数据库服务器崩溃启动不起来,经过我初步排查是/home目录挂载不上导致启动不起来,虽然通过注释/etc/fstab中的挂载信息,启动了,但是发现mysql数据有缺失,还好有例......
  • 【Datawhale】动手学数据分析
    动手学数据分析第一章:数据载入及初步观察载入数据任务一:导入numpy和pandasimportnumpyasnpimportpandasaspd任务二:载入数据train_data=pd.read_csv("train......
  • java出现javax/xml/bind/DatatypeConverter错误
     今天在使用Jwt生成token的时候出现的错误,错误如下:Exceptioninthread"main"java.lang.NoClassDefFoundError:javax/xml/bind/DatatypeConverter atio.jsonwebtoken......
  • SpringDataJpa源码理解
    SpringDataJpa源码理解上一篇讲解了SpringDataJpa的基本使用,下面简单说一下源码我们以其中的一个test为案例进行分析: 我们可以发现resumeDao它是一个代理对象,类型是Si......
  • datatable 转list
    C#DataTable转List方法,网上有好多,之前也收集了,感觉这个也不错,重要是自己要领会这里面的代码含义。接不来我就把代码贴出来分享一下,大家觉得如果不好,请留言我,我来改进。u......
  • 利用pandas的read_csv()读取data文件
    1.pycharm新建一个项目,如Python_paper_experiment2.下载一个.data数据集,如UCI数据集之iris.data(150个样本,五个属性:"sepallength","sepalwidth","petallength","pet......
  • 调试micro python WiFi 通信,用网络调试助手发送中文接时,运行到recv_data.decode('utf-
    问题:调试micropythonWiFi通信,用网络调试助手发送中文接时,运行到recv_data.decode('utf-8')出错了解决:用祥哥的写的Python代码发送中文没问题......
  • linear jump Markov models
    只有当雅可比矩阵存在时,才能应用线性化。但是,情况并非总是如此。一些系统包含不连续性(例如,过程模型可能是跳跃线性的[14],其中参数可以突然变化[1],或者传感器可能返回高度量......