Markow Chain & Stationary Distribution
Def(Finite Markow Chain). Let \(\Omega\) be a finite set of states, \(\forall x,y\in\Omega,P(x,y)\ge 0\) be a transition function, i.e., \(\sum\limits_{y\in\Omega} P(x,y)=1.\) A finite Markow chain is a sequence of random variables
\[(X_0,X_1,X_2,...) \]if \(\forall t\), for any \(x_0,x_1,...,x_t\in\Omega\) we have
\[\mathbb{P}(X_{t+1}=x_{t+1}\mid X_0=x_0,X_1=x_1,...,X_t=x_t)=\mathbb{P}(X_{t+1}=x_{t+1}\mid X_t=x_t)=P(x_t,x_{t+1}). \]Given the initial distribution \(\mu_0\) over \(\Omega\) and the transition matrix \(P\), we can calculate the distribution after \(t\) transitions
\[\mu_t=\mu_{t-1}P=\mu_0P^t. \]Def. A distribution \(\pi\) is called stationary if \(\pi P=\pi\).
Lem(Brouwer's fixed point theorem). Every continuous map \(f\) from a convex, closed, bounded set to itself, has a fixed point \(f(x)=x\).
Thm(Perron-Frobemins thm). Any Markov chain over finite states has a stationary distribution.
Pf. Let \(|\Omega|=n,\Delta^{n-1}=\{\mu\in\mathbb{R}^n\mid \mu_i\ge 0,\sum\mu_i=1\}\). We can show that \(\Delta^{n-1}\) is convex, closed, bounded. Since \(P\) is a linear map from \(\Delta^{n-1}\) to itself, by Brouwer's fixed point theorem there is a stationary distribution \(\pi\in\Delta^{n-1}\) s.t. \(\pi P=\pi\).
Def. A finite Markov chain is irreducible (connected) if \(\forall x,y\in\Omega,\exists t>0\) s.t. \(P^t(x,y)>0\).
Thm(Fundamental Theorem of Markov Chains). For a connected Markov chain there is a unique stationary distribution \(\pi\). Moreover, any initial distribution will converge to \(\pi\).
Pf. Suppose \(Ph=h\) for some \(h\in\mathbb{R}^n\). Suppose \(i_0=\arg\max\limits_i h_i\). Let \(h(i_0)=M\). Then \(\forall j\in\Omega\) s.t. \(P(i_0,j)>0\), then \(h(j)=M\). By irreducibility, \(\forall i, h(i)=M\). Thus \(Ph=h\) implies \(h=M\vec{1}\). It immediately gives \(\text{rank}(P-I)=n-1\) and the theorem.
Def(Reversed Markov Chain). Given a Markov Chain \((P,\Omega)\). Denote \(\pi\) as a stationary distribution. Let $$\hat{P}(x,y)=\frac{\pi(y)P(y,x)}{\pi(x)}.$$ Then \((\hat{P},\Omega)\) is the reversed Markov chain.
Markov Chain Monte Carlo
Given a distribution \(\pi\) and a function \(f\) over \(\Omega\) (usually \(\Omega=\{0,1...,n-1\}^d\)), estimate
\[\underset{\pi}{\mathbb{E}}(f)=\sum\limits_{x\in\Omega}f(x)\pi(x). \]MCMC method:
- Design \(P\) s.t. \(\pi P=\pi\).
- Generate \(X_0,X_1,...,X_T\). \(X_t\) is transferred from \(X_{t-1}\) according to \(P\).
- Calculate \(\frac{1}{T}\sum\limits_{t=1}^Tf(X_T)\) as an estimate of \(\mathbb{E}(f)\).
Let \(\hat{\mu}=\frac{1}{T}\sum\limits_{t=1}^T\mu_t\). If \(\lim\limits_{t\rightarrow\infty}\mu_t=\pi\), then \(\lim\limits_{T\rightarrow\infty}\underset{\hat{\mu}}{\mathbb{E}}(f)=\underset{\pi}{\mathbb{E}}(f)\), where. Since \(X_t\sim \mu_t\), it works.
Consider a graph \(G=(\Omega, E)\), where \(E=\{(x,y)\mid P(x,y)>0\}.\)
To ensure the time complexity, the degree of vertices in \(G\) should be small.
To ensure the convergence rate, the diameter of \(G\) should be small (or holds some other property).
For example, if \(\Omega=\{0,1\}^d\), a possible way is to let \(G\) be the hypercube.
Metropolis-Hasting Algorithm
\[P(x,y)= \begin{cases} \Phi(x,y)(1\wedge\frac{\pi(y)}{\pi(x)}) & x\neq y\\ 1-\sum\limits_{z\neq x}\Phi(x,z)(1\wedge\frac{\pi(z)}{\pi(x)}) & x=y \end{cases}\]here \(\Phi\) is a symmetric transition matrix.
Lem(Detailed balance equation). For a given transition matrix \(P\), if there exists a distribution \(\pi\) s.t.
\[\pi (x) P(x,y)=P(y) \pi (y,x),\forall x,y \in \Omega , \]then \(\pi\) is a stationary distribution and \(P\) is reversible.
Ex. Given a graph \(G=(V,E)\). Let \(f:\{0,1\}^{|V|}\rightarrow\mathbb{Z}^{+}\) be the size of cut. Find \(\max\limits_x f(x).\)
Solution: Define \(\pi_{\lambda}(x)=\lambda^{f(x)}\). Applying MH algorithm with \(\Phi(x,y)=\left[\sum\limits_i[x(i)\neq y(i)]=1\right]\).
Gibbs Sampling
\[P(x,y)= \begin{cases} \frac{1}{d}\frac{\pi(y)}{\sum\limits_a\pi(x_1,...,x_{i-1},a,x_{i+1},...,x_n)} & \{j\mid x_j\neq y_j\}=\{i\}\\ 0 & otherwise \end{cases}\]Intuitively, choose a variables \(x_i\) randomly and update it according to \(\pi\). A alternative scheme is choose \(x_i\) by sequentially scanning from \(x_1\) to \(x_d\).