首页 > 其他分享 >Information Theory and Coding

Information Theory and Coding

时间:2022-12-14 00:33:06浏览次数:42  
标签:Information right mathbf Theory sum Coding mid log left

Information Theory and Coding

Recover: Probability and Statistics

A probability model (or probability experiment or probability system) has three ingredients

  • A sample space \(\Omega\), the set of all possible outcomes (sample points) of the experiment;
  • A set of events \(\mathcal{F}\)
    (1) \(\Omega \in \mathcal{F}\);
    (2) \(A \in \mathcal{F} \Rightarrow \bar{A} \in \mathcal{F}\);
    (3) \(A_i \in \mathcal{F} \Rightarrow \bigcup_i A_i \in \mathcal{F}\).
  • A rule \(P(\cdot)\) for assigning probabilities to events.
    (1) \(P(A) \geqslant 0\);
    (2) \(P(\Omega)=1\);
  1. \(P\left(\biguplus_i A_i\right)=\sum_i P\left(A_i\right)\);
    Remark. Probability can be considered as a general function, whose arguments are taken as events and whose values fall in the interval \([0,1]\).

Probability Laws

  • \(P(\varnothing)=0\);
  • \(P(A \cup B)=P(A)+P(B)-P(A B) ; P(A \cup B) \leqslant P(A)+P(B)\); for disjoint events, we have \(P(A \cup B)=P(A)+P(B)\);
  • For any two events \(A\) and \(B\) with \(P(B)>0\), the conditional probability of \(A\), conditional on \(B\), is defined by \(P(A \mid B) \triangleq \frac{P(A B)}{P(B)}\);
  • \(P(A B)=P(A) P(B \mid A)=P(B) P(A \mid B)\); for independent events, we have \(P(A B)=P(A) P(B)\)
  • the law of total probability: \(P(A)=\sum_i P\left(B_i\right) P\left(A \mid B_i\right)\) where \(B_i\) is a partition of \(\Omega\);
  • Bayes' equation: \(P\left(B_i \mid A\right)=\frac{P\left(B_i\right) P\left(A \mid B_i\right)}{P(A)}=\frac{P\left(B_i\right) P\left(A \mid B_i\right)}{\sum_j P\left(B_j\right) P\left(A \mid B_j\right)}\);
  • sum and product.
  • When calculating probabilities, it is strongly suggested to clarify the considered probability model and draw a probability tree.

Random Variable

A general r.v. \(X\) can be specified by its cumulative distribution function \((c d f) F_X(x) \triangleq \operatorname{Pr}\{X \leq x\}, x \in \mathbb{R}\)
(1) for discrete \(X, F_X(x)=\sum_{x_i \leqslant x} P_X\left(x_i\right)\);
(2) for continuous \(X, F_X(x)=\int_{y \leqslant x} f_X(y) \mathrm{d} y\);
(3) \(F_X(x)\) is a non-decreasing function ranging from 0 to 1 ; \(\lim _{x \rightarrow-\infty} F_X(x)=0\) and \(\lim _{x \rightarrow+\infty} F_X(x)=1\)
(4) \(F_X(x)\) has a step at \(y\) iff. \(\operatorname{Pr}\{X=y\}>0\); In this case, the height of this step is \(\operatorname{Pr}\{X=y\}>0\).

Convergence

Definition 1
A sequence \(X_1, X_2, \ldots\) of random variables is said to converge in distribution to a random variable \(X\) if

\[\lim _{n \rightarrow \infty} F_n(x)=F(x), \]

for each \(x \in \mathbb{R}\) at which \(F\) is continuous. Here \(F_n\) and \(F\) are the cumulative distribution functions of random variables \(X_n\) and \(X\) correspondingly.
Definition 2
A sequence \(X_1, X_2, \ldots\) of random variables is said to converge in mean square to \(X\) if

\[\lim _{n \rightarrow \infty} \mathbf{E}\left[\left(X_n-X\right)^2\right]=0, \]

where the operator \(\mathbf{E}\) denotes the expected value.

Definition 3
A sequence \(X_n\) of random variables converges in probability towards \(X\) if for all \(\varepsilon>0\)

\[\lim _{n \rightarrow \infty} P\left\{\left|X_n-X\right|>\varepsilon\right\}=0 \]

Definition 4
To say that the sequence \(X_n\) converges almost surely or almost everywhere or with probability 1 or strongly towards \(X\) means that

\[P\left\{\lim _{n \rightarrow \infty} X_n=X\right\}=1 . \]

Remark.

convergence with probability \(1 \Rightarrow\) convergence in probability;

convergence in mean square \(\Rightarrow\) convergence in probability;

convergence in probability \(\Rightarrow\) convergence in distribution;

Problems
(1) Could two disjoint events \(A\) and \(B\) be independent?

If events are disjoint then they must be not independent, i.e. they must be dependent events. Why is that? Recall: If A and B are disjoint then they cannot happen together. In other words, A and B being disjoint events implies that if event A occurs then B does not occur and vice versa.

(2) Given an example to show that \(X, Y\) and \(Z\) are pairwise independent but not independent.

(3) A binary random vector \(\underline{X}=\left(X_1, X_2, \cdots, X_n\right)\) with independent, identically distributed components, where \(P\left(X_i=1\right)=p\) and \(P\left(X_i=0\right)=1-p\). Find an algorithm to compute the probability that \(\underline{X}\) has \(m\) consecutive ones. Considering an example of \(n=5\) and \(m=2\), we say \((0,1,1,1,0)\) has two consecutive ones but \((0,0,0,1,0)\) does not have. Of course, we assume that \(m \leq n\).
(4) How to bound \(P\left(X_1 \geqslant y_1, X_2 \leqslant y_2\right)\) ?
If \(X_1\) and \(X_2\) are independent, we have \(P\left(X_1 \geqslant y_1, X_2 \leqslant y_2\right) \leqslant\) \(\mathbf{E}\left[\exp \left(s_1 X_1\right)\right] \exp \left(-s_1 y_1\right) \cdot \mathbf{E}\left[\exp \left(s_2 X_2\right)\right] \exp \left(-s_2 y_2\right)\) for \(s_1 \geqslant 0\) and \(s_2 \leqslant 0\). If \(X_1\) and \(X_2\) are not independent, \(P\left(X_1 \geqslant y_1, X_2 \leqslant y_2\right) \leqslant \mathbf{E}\left[\exp \left(s_1 X_1+s_2 X_2\right)\right] \exp \left(-s_1 y_1\right) \cdot \exp \left(-s_2 y_2\right)\) for \(s_1 \geqslant 0\) and \(s_2 \leqslant 0\). Why?

Self-information

Let \(X\) be a discrete random variable with \(p m f P_X(x), x \in \mathcal{X}\). For the event \(x \in \mathcal{X}\), its self-information is defined by

\[I(x)=\log \frac{1}{P_X(x)} . \]

Is \(I(X)\) is a random variable?

About self-information

(1) The smaller the probability, the bigger the self-information.
(2) The surprise you get increases with the decrease of the probability.
(3) More surprise, more information you get.

Entropy

We consider a simple but very important source, discrete memoryless source, \(\mathbf{X}=\left(X_1, X_2, \cdots, X_n, \cdots\right)\) satisfying that

\[\begin{aligned} \text { memoryless } & P_{\mathbf{X}}(\mathbf{x})=\prod_{1 \leq t \leq n} P_{X_t}\left(x_t\right) \text { for } n>1 \\ \text { stationary } & P_{X_t}(x) \equiv P_{X_1}(x) \triangleq P_X(x) \text { for } t>1 \end{aligned} \]

Such a random process \(\mathbf{X}\) is also referred to as an independent, identically distributed (i. i. d.) sequence, which is characterized completely by its one-dimensional marginal distribution.
Let \(X\) be a discrete random variable with \(\operatorname{pmf} P_X(x), x \in \mathcal{X}\). The entropy of \(X\), denoted by \(H(X)\) is defined by

\[H(X)=\sum_{x \in \mathcal{X}} P_X(x) \log \frac{1}{P_X(x)}=\mathbf{E}\left[\log \frac{1}{P_X(X)}\right] \]

which is the average of the self-information.

We first introduce the concept of entropy, which is a measure of the uncertainty of a random variable. Let \(X\) be a discrete random variable with alphabet \(\mathcal{X}\) and probability mass function \(p(x)=\operatorname{Pr}\{X=x\}, x \in \mathcal{X}\).
We denote the probability mass function by \(p(x)\) rather than \(p_X(x)\), for convenience. Thus, \(p(x)\) and \(p(y)\) refer to two different random variables and are in fact different probability mass functions, \(p_X(x)\) and \(p_Y(y)\), respectively.

Definition The entropy \(H(X)\) of a discrete random variable \(X\) is defined by

\[H(X)=-\sum_{x \in \mathcal{X}} p(x) \log p(x) \]

remarks.
(1) bits/symbol, nats/symbol; bits/second, nats/second;
(2) The smaller the probability of \(x\), the larger its self-information;
(3) \(H(X)=\mathbf{E}[I(X)]\) is an expectation, which can have at least three intuitive interpretations

  • How difficult to resolve (guess) X?
  • How complex to represent \(X\) ?
  • How surprising to observe \(X\) ?

(4) \(0 \log 0 \triangleq 0\);
(5) \(H(X) \geq 0\)
(6) \(H(X) \leq \log |\mathcal{X}|\). [Hint: use \(\ln z \leq z-1\) ]
(7) \(H(p) \triangleq-p \log p-(1-p) \log (1-p)\);
(8) \(H(X)\) depends on not the values of \(X\) but \(P_X\).
(9) \(H(X)\) is a concave function of \(P_X\).

Joint Entropy and Conditional Entropy

Definition The joint entropy \(H(X, Y)\) of a pair of discrete random variables \((X, Y)\) with a joint distribution \(p(x, y)\) is defined as

\[H(X, Y)=\sum \sum_{x, y} p(x, y) \log \frac{1}{p(x, y)}, \]

which can be also written as

\[H(X, Y)=\mathbf{E}[-\log p(X, Y)] . \]

Definition If \((X, Y) \sim p(x, y)\), the conditional entropy \(H(Y \mid X)\) is defined as

\[H(Y \mid X)=\mathbf{E}[-\log p(Y \mid X)] \]

According to the definitions, we have

\[H(X, Y)=H(X)+H(Y \mid X) . \]

All uncertainty \(=\) single uncertainty \(+\) conditional uncertainty

More generally, we have

Theorem 1 (Chain rule for entropy)
Let \(X_1, X_2, \cdots, X_n\) be drawn according to \(p\left(x_1, x_2, \cdots, x_n\right)\). Then

\[H\left(X_1, X_2, \cdots, X_n\right)=\sum_{i=1}^n H\left(X_i \mid X_{i-1}, \cdots, X_1\right) \]

Also note that \(I(x, y)=I(x)+I(y \mid x)\)
Prove \(H(X, Y \mid Z)=H(X \mid Z)+H(Y \mid X, Z)\)

Theorem 2 (Conditioning reduces entropy)

\[H(X \mid Y) \leq H(X) \]

Proof:

\[\begin{aligned} &H(X \mid Y)-H(X)=-\sum \sum p(x, y) \ln p(x \mid y)+\sum p(x) \ln p(x)= \\ &\sum \sum p(x, y) \ln \frac{p(x)}{p(x \mid y)}, \\ &\text { or equivalently } \\ &\qquad H(X \mid Y)-H(X)=\sum \sum p(x, y) \ln \frac{p(x) p(y)}{p(x, y)} \end{aligned} \]

Using \(\ln z \leq z-1\) to terms with \(p(x, y)>0\), we have

\[H(X \mid Y)-H(X) \leq \sum \sum p(x, y)\left(\frac{p(x) p(y)}{p(x, y)}-1\right) \leq 0 \]

Question: under what conditions the equality holds?

Theorem 3
Entropy \(H(X)\) is a concave function of the probability.
Proof: \(X_1 \sim p_1\) and \(X_2 \sim p_2\). Let \(\theta\) be the random variable such that \(p(\theta=1)=\lambda\) and \(p(\theta=2)=1-\lambda\)
Define the \(r v Z=X_\theta\).
Then \(Z=X_1\) with probability \(\lambda\) and \(Z=X_2\) with probability \(1-\lambda\) The distribution of \(Z\) is \(\lambda p_1+(1-\lambda) p_2\).
Since conditioning reduces entropy, we have

\[H(Z) \geq H(Z \mid \theta), \]

or equivalently,

\[H\left(\lambda p_1+(1-\lambda) p 2\right) \geq \lambda H\left(p_1\right)+(1-\lambda) H\left(p_2\right) \]

which proves the concavity of the entropy as a function of the distribution.

Definition 4
The relative entropy or Kullback-Leibler distance between two probability mass functions \(p(x)\) and \(q(x)\) is defined as

\[D(p \| q) \triangleq \sum_x p(x) \log \frac{p(x)}{q(x)} . \]

Alternative definition of \(\mathrm{KL}\)-distance is \(D(p \| q)=\mathbf{E}_p\left[\log \frac{p(X)}{q(X)}\right]\).
[Show that \(D(p \| q)\) is non-negative]

Particularly, we can define

\[D(p(x, y) \| p(x) p(y))=\sum \sum p(x, y) \log \frac{p(x, y)}{p(x) p(y)}, \]

which is called the mutual information between \(X\) and \(Y\) and is denoted as \(I(X ; Y)\).
Obviously, we have \(I(X ; Y) \geq 0\)

By the definition, we have

\[\begin{aligned} I(X ; Y) &=H(X)-H(X \mid Y) \\ \text { and } & \\ I(X ; Y) &=H(Y)-H(Y \mid X) \end{aligned} \]

  • Mutual information \(I(X ; Y)\) is the reduction in the uncertainty of \(X\) due to the knowledge of \(Y\).
  • Similarly, mutual information \(I(X ; Y)\) is the reduction in the uncertainty of \(Y\) due to the knowledge of \(X\).
  • How about \(I(X ; X)\) ?
  • If \(X\) and \(Y\) are independent, how about \(I(X ; Y)\) ?

Theorem 2.5.2 (Chain rule for information)

\[I\left(X_1, X_2, \ldots, X_n ; Y\right)=\sum_{i=1}^n I\left(X_i ; Y \mid X_{i-1}, X_{i-2}, \ldots, X_1\right) \]

Proof

\[\begin{aligned} &I\left(X_1, X_2, \ldots, X_n ; Y\right) \\ &\quad=H\left(X_1, X_2, \ldots, X_n\right)-H\left(X_1, X_2, \ldots, X_n \mid Y\right) \\ &\quad=\sum_{i=1}^n H\left(X_i \mid X_{i-1}, \ldots, X_1\right)-\sum_{i=1}^n H\left(X_i \mid X_{i-1}, \ldots, X_1, Y\right) \\ &\quad=\sum_{i=1}^n I\left(X_i ; Y \mid X_1, X_2, \ldots, X_{i-1}\right) . \end{aligned} \]

Properties

  • Nonnegative.
  • Convex in \(p(y \mid x)\)
  • Concave in \(p(x)\)
  • Data processing inequality

Jensen’s inequality

Theorem 5
If \(f\) is a convex function and \(X\) is a random variable, then

\[\mathbf{E}[f(X)] \geq f(\mathbf{E}[X]) \]

Moreover, if \(f\) is strictly convex, the equality implies that \(X=\mathbf{E}[X]\) with probability 1 (i.e., \(X\) is a constant).

Generally used convex functions include \(\log x, x^2,|x|, \exp (x)\), and \(x \log x\),

Theorem 6
Let \(p\) and \(q\) be two probability mass functions. Then

\[D(p \| q) \geq 0 \]

with equality iff \(p(x)=q(x)\) for all \(x \in \mathcal{X}\).
Theorem 7

\[H(X) \leq \log (|\mathcal{X}|) \]

Convexity of \(\mathrm{KL}\)-distance

Theorem 8
\(D(p \| q)\) is convex in the pair \((p, q)\).
Proof: A direct use of the log sum inequality.

Theorem 2.7.1 (Log sum inequality) For nonnegative numbers, \(a_1, a_2, \ldots, a_n\) and \(b_1, b_2, \ldots, b_n\),

\[\sum_{i=1}^n a_i \log \frac{a_i}{b_i} \geq\left(\sum_{i=1}^n a_i\right) \log \frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i} \]

with equality if and only if \(\frac{a_i}{b_i}=\) const.
We again use the convention that \(0 \log 0=0, a \log \frac{a}{0}=\infty\) if \(a>0\) and \(0 \log \frac{0}{0}=0\). These follow easily from continuity.

Proof: Assume without loss of generality that \(a_i>0\) and \(b_i>0\). The function \(f(t)=t \log t\) is strictly convex, since \(f^{\prime \prime}(t)=\frac{1}{t} \log e>0\) for all positive \(t\). Hence by Jensen's inequality, we have

\[\sum \alpha_i f\left(t_i\right) \geq f\left(\sum \alpha_i t_i\right) \]

for \(\alpha_i \geq 0, \sum_i \alpha_i=1\). Setting \(\alpha_i=\frac{b_i}{\sum_{j=1}^n b_j}\) and \(t_i=\frac{a_i}{b_i}\), we obtain

\[\sum \frac{a_i}{\sum b_j} \log \frac{a_i}{b_i} \geq \sum \frac{a_i}{\sum b_j} \log \sum \frac{a_i}{\sum b_j}, \]

which is the log sum ineauality.

Theorem 11
Let \((X, Y) \sim p(x, y)=p(x) p(y \mid x)\). The mutual information \(I(X ; Y)\) is a concave function of \(p(x)\) for fixed \(p(y \mid x)\) and a convex function of \(p(y \mid x)\) for fixed \(p(x)\)

Data Processing Inequality

Definition Random variables \(X, Y, Z\) are said to form a Markov chain in that order (denoted by \(X \rightarrow Y \rightarrow Z\) ) if the conditional distribution of \(Z\) depends only on \(Y\) and is conditionally independent of \(X\). Specifically, \(X, Y\), and \(Z\) form a Markov chain \(X \rightarrow Y \rightarrow Z\) if the joint probability mass function can be written as

\[p(x, y, z)=p(x) p(y \mid x) p(z \mid y) . \]

Some simple consequences are as follows:

  • \(X \rightarrow Y \rightarrow Z\) if and only if \(X\) and \(Z\) are conditionally independent given \(Y\). Markovity implies conditional independence because

\[p(x, z \mid y)=\frac{p(x, y, z)}{p(y)}=\frac{p(x, y) p(z \mid y)}{p(y)}=p(x \mid y) p(z \mid y) . \]

This is the characterization of Markov chains that can be extended to define Markov fields, which are \(n\)-dimensional random processes in which the interior and exterior are independent given the values on the boundary.

  • \(X \rightarrow Y \rightarrow Z\) implies that \(Z \rightarrow Y \rightarrow X\). Thus, the condition is sometimes written \(X \leftrightarrow Y \leftrightarrow Z\).
  • If \(Z=f(Y)\), then \(X \rightarrow Y \rightarrow Z\).

Theorem 13
If \(X \rightarrow Y \rightarrow Z\) form a Markov chain. Then \(I(X ; Y) \geq I(X ; Z)\)
Proof: We have

\[I(X ; Y, Z)=I(X ; Z)+I(X ; Y \mid Z)=I(X ; Y)+I(X ; Z \mid Y) \]

As \(I(X ; Z \mid Y)=0\), then \(I(X ; Y) \geq I(X ; Z)\)
[non-negativity of mutual information]

Proof: By the chain rule, we can expand mutual information in two different ways:

\[\begin{aligned} I(X ; Y, Z) & =I(X ; Z)+I(X ; Y \mid Z) \\ & =I(X ; Y)+I(X ; Z \mid Y) . \end{aligned} \]

Since \(X\) and \(Z\) are conditionally independent given \(Y\), we have \(I(X ; Z \mid Y)=0\). Since \(I(X ; Y \mid Z) \geq 0\), we have

\[I(X ; Y) \geq I(X ; Z) . \]

We have equality if and only if \(I(X ; Y \mid Z)=0\) (i.e., \(X \rightarrow Z \rightarrow Y\) forms a Markov chain). Similarly, one can prove that \(I(Y ; Z) \geq I(X ; Z)\).

Fano's Inequality

Theorem 2.10.1 (Fano's Inequality) For any estimator \(\hat{X}\) such that \(X \rightarrow Y \rightarrow \hat{X}\), with \(P_e=\operatorname{Pr}(X \neq \hat{X})\), we have

\[H\left(P_e\right)+P_e \log |\mathcal{X}| \geq H(X \mid \hat{X}) \geq H(X \mid Y) . \]

This inequality can be weakened to

\[1+P_e \log |\mathcal{X}| \geq H(X \mid Y) \]

or

\[P_e \geq \frac{H(X \mid Y)-1}{\log |\mathcal{X}|} . \]

Proof: Define an error random variable,

\[E= \begin{cases}1 & \text { if } \hat{X} \neq X \\ 0 & \text { if } \hat{X}=X\end{cases} \]

Then, using the chain rule for entropies to expand \(H(E, X \mid \hat{X})\) in two different ways, we have

\[\begin{aligned} H(E, X \mid \hat{X}) & =H(X \mid \hat{X})+\underbrace{H(E \mid X, \hat{X})}_{=0} \\ & =\underbrace{H(E \mid \hat{X})}_{\leq H\left(P_e\right)}+\underbrace{H(X \mid E, \hat{X})}_{\leq P_e \log |\mathcal{X}|} . \end{aligned} \]

Since conditioning reduces entropy, \(H(E \mid \hat{X}) \leq H(E)=H\left(P_e\right)\). Now since \(E\) is a function of \(X\) and \(\hat{X}\), the conditional entropy \(H(E \mid X, \hat{X})\) is equal to 0 . Also, since \(E\) is a binary-valued random variable, \(H(E)=\) \(H\left(P_e\right)\). The remaining term, \(H(X \mid E, \hat{X})\), can be bounded as follows:

\[\begin{aligned} H(X \mid E, \hat{X}) & =\operatorname{Pr}(E=0) H(X \mid \hat{X}, E=0)+\operatorname{Pr}(E=1) H(X \mid \hat{X}, E=1) \\ & \leq\left(1-P_e\right) 0+P_e \log |\mathcal{X}| \end{aligned} \]

since given \(E=0, X=\hat{X}\), and given \(E=1\), we can upper bound the conditional entropy by the log of the number of possible outcomes. Combining these results, we obtain

\[H\left(P_e\right)+P_e \log |\mathcal{X}| \geq H(X \mid \hat{X}) . \]

By the data-processing inequality, we have \(I(X ; \hat{X}) \leq I(X ; Y)\) since \(X \rightarrow Y \rightarrow \hat{X}\) is a Markov chain, and therefore \(H(X \mid \hat{X}) \geq H(X \mid Y)\). Thus, we have

\[H\left(P_e\right)+P_e \log |\mathcal{X}| \geq H(X \mid \hat{X}) \geq H(X \mid Y) \]

A bit stronger version, when the alphabet of \(X\) is small, it provides a tighter bound.

\[\begin{aligned} H(X \mid E, \hat{X}) & =\operatorname{Pr}(E=0) H(X \mid \hat{X}, E=0)+\operatorname{Pr}(E=1) H(X \mid \hat{X}, E=1) \\ & \leq\left(1-P_e\right) 0+P_e \log (|\mathcal{X}| -1) \end{aligned} \]

\[H\left(P_e\right)+P_e \log ( |\mathcal{X}|-1) \geq H(X \mid \hat{X}) . \]

\[H\left(P_e\right)+P_e \log (|\mathcal{X}|-1) \geq H(X \mid \hat{X}) \geq H(X \mid Y) \]

Sufficient Statistics

  • Suppose we have a family of probability mass function \(\left\{f_\theta(x)\right\}\) indexed by \(\theta\).
  • Let \(X\) be a sample from a distribution in this family.
  • Let \(T(X)\) be any statistic, a function of \(X\) (sample mean or variance).
    We have the following Markov chain

\[\theta \rightarrow X \rightarrow T(X) \]

According to the data processing inequality, we have

\[I(\theta ; T(X)) \leq I(\theta ; X) \]

If equality holds, no information loss.

Theorem 16 (Coding Theorem for DMS)
Let \(R>H(X)\). Then there exist fixed-length \(\operatorname{codes}\left(\phi_n, \psi_n\right)\) such that \(R_n \leq R\) but \(\epsilon_n \rightarrow 0\). In the case when variable-length codes are allowed, we can make \(\epsilon_n=0\).

Theorem 17 (Weak Law of Large Numbers)
If \(X_1, X_2, \ldots\) are i.i.d. \(\sim P_X(x), x \in \mathcal{X}\), then \(-\frac{1}{n} \log P_{\mathbf{X}}(\mathbf{X}) \rightarrow H(X)\) in probability.

Definition 18 (Typical Set)
The typical set \(A_\epsilon^{(n)}\) with respect to \(P_X(x)\) is the set of sequences \(\mathbf{x}=\left(x_1, x_2, \ldots, x_n\right) \in \mathcal{X}^n\) with the following property:

\[2^{-n(H(X)+\epsilon)} \leq P_{\mathbf{X}}(\mathbf{X}) \leq 2^{-n(H(X)-\epsilon)} \]

"The typical set has probability nearly 1 , all elements of the typical set are nearly equiprobable, and the number of elements in the typical set is nearly \(2^{n H}\) ".
Almost all events are almost equally surprising!
Theorem 19 (Asymptotic Equipartition Property (AEP))

  • If \(\mathbf{x} \in A_\epsilon^{(n)}\), then \(H(X)-\epsilon \leq-\frac{1}{n} \log P_{\mathbf{X}}(\mathbf{x}) \leq H(X)+\epsilon\)
  • For sufficiently large \(n, \operatorname{Pr}\left\{A_\epsilon^{(n)}\right\} \geq 1-\epsilon\).
  • \(\left|A_\epsilon^{(n)}\right| \leq 2^{n(H(X)+\epsilon)}\), where \(|A|\) denotes the number of elements in the set \(A\).
  • For sufficiently large \(n,\left|A_\epsilon^{(n)}\right| \geq(1-\epsilon) 2^{n(H(X)-\epsilon)}\).

Outline of the proof:
Encoding:
Firstly, we divide all sequences in \(\mathcal{X}^n\) into two sets: the typical set \(A_\epsilon^{(n)}\) and its complements \(\mathcal{X}^n-A_\epsilon^{(n)}\).
Secondly, we order all sequences in each set according to certain order. Then we can represent each sequence of \(A_\epsilon^{(n)}\) by giving the index of the sequence in the set, which requires no more than \(\lceil n(H+\epsilon)\rceil\) bits.
Similarly, we can represent each sequence in \(\mathcal{X}^n-A_\epsilon^{(n)}\) by no more than \(\lceil n \log |\mathcal{X}|\rceil\) bits.
Thirdly, prefix all sequences in the typical set \(A_\epsilon^{(n)}\) and its complement by a 0 and a 1 , respectively.

Outline of the proof:
Decoding: How to decode?
Error analysis:
Obviously, this is a one-to-one mapping from the set \(\mathcal{X}^n\) into binary strings. Therefore the mapping is invertible and the decoding error \(\epsilon_n\) is zero.

Coding rate:

\[\begin{aligned} & R_n=\frac{1}{n} \sum_{\mathbf{X}} P_{\mathbf{X}}(\mathbf{x}) \ell\left(\phi_n(\mathbf{x})\right) \\ &=\frac{1}{n}\left[\sum_{\mathbf{X} \in A_\epsilon^{(n)}} P_{\mathbf{X}}(\mathbf{x}) \ell\left(\phi_n(\mathbf{x})\right)+\sum_{\mathbf{x} \in \mathcal{X}^n-A_\epsilon^{(n)}} P_{\mathbf{X}}(\mathbf{x}) \ell\left(\phi_n(\mathbf{x})\right)\right] \\ & \leqslant \frac{1}{n}\left[\sum_{\mathbf{x} \in A_\epsilon^{(n)}} P_{\mathbf{X}}(\mathbf{x})(n(H(X)+\epsilon)+2)+\sum_{\mathbf{x} \in \mathcal{X}^n-A_\epsilon^{(n)}} P_{\mathbf{X}}(\mathbf{x})(n \log |\mathcal{X}|+2)\right] \\ &=\frac{1}{n}\left[\operatorname{Pr}\left\{A_\epsilon^{(n)}\right\}(n(H(X)+\epsilon)+2)+\operatorname{Pr}\left\{\mathcal{X}^n-A_\epsilon^{(n)}\right\}(n \log |\mathcal{X}|+2)\right] \\ & \leq \frac{1}{n}[n(H(X)+\epsilon)+\epsilon n \log |\mathcal{X}|+2]=H(X)+\epsilon^{\prime}, \\ & \text { where } \epsilon^{\prime}=\epsilon+\epsilon \log |\mathcal{X}|+\frac{2}{n} \text { is preset to satisfy } R \geq H(X)+\epsilon^{\prime} \text { since } R>H(X) . \\ & \text { Therefore, } R_n \leq H(X)+\epsilon^{\prime} \leq R . \end{aligned} \]

Problems:
(1) Encoding: How to encode? Variable-length or fixed-length?
(2) Rate: What is the rate? Does it depend on the block-length of the source?
(3) Decoding: How to decode?
(4) Error analysis: Is it zero or not?
(5)What knowledge do we need for the source?

Outline of the proof:
Imagine that we have \(2^{n R}\) bins indexed by \(\left\{1,2, \ldots, 2^{n R}\right\}(n R\) bits). All the sequences \(\mathbf{x} \in \mathcal{X}^n\) are thrown at random into these bins. These bins are then disclosed to the decoder. That is, \(\phi_n: \mathcal{X}^n \rightarrow\left\{1,2, \ldots, 2^{n R}\right\}\).
Encoding:
Assume that we have observed a sequence \(\mathbf{x} \in \mathcal{X}^n\), we then send \(\phi_n(\mathbf{x})\). mapping of multiple to one!
Decoding:
For decoding \(\phi_n(\mathbf{x})\), we look for a typical sequence in the bin indexed by \(\phi_n(\mathbf{x})\). If there is one and only one typical sequence \(\hat{\mathbf{x}}\) in this bin, we declare that \(\hat{\mathbf{x}}\) is the estimate of the \(\mathbf{x}\). That is, \(\psi_n:\left\{1,2, \ldots, 2^{n R}\right\} \rightarrow \mathcal{X}^n\). Coding rate:

\[R_n=R \text {. } \]

Decoding error:
Let \(\epsilon>0\) be an arbitrarily small real number such that \(R>H(X)+\epsilon\).

\[\begin{aligned} \epsilon_n & =\operatorname{Pr}\left\{\psi_n\left(\phi_n(\mathbf{X})\right) \neq \mathbf{X}\right\} \\ & \leqslant \operatorname{Pr}\left\{\mathbf{X} \notin A_\epsilon^{(n)}\right\}+\sum_{\substack{\mathbf{x}^{\prime} \mathbf{x}^{\prime} \in \mathbf{c}_\epsilon^{(n)} \\ \mathbf{x}^{\prime} \neq \mathbf{x}}} P_{\mathbf{x}}(\mathbf{x}) \operatorname{Pr}\left\{\phi_n\left(\mathbf{x}^{\prime}\right)=\phi_n(\mathbf{x})\right\} \\ & \leqslant \epsilon+\sum_{\mathbf{x}^{\prime} \in A_\epsilon^{(n)}} \sum_{\mathbf{x} \in A_\epsilon^{(n)}} P_{\mathbf{X}}(\mathbf{x}) 2^{-n R} \\ & =\epsilon+\sum_{\mathbf{x}^{\prime} \in A_\epsilon^{(n)}} 2^{-n R} \sum_{\mathbf{x} \in A_\epsilon^{(n)}} P_{\mathbf{X}}(\mathbf{x}) \\ & \leqslant \epsilon+\sum_{\mathbf{x}^{\prime} \in A_\epsilon^{(n)}} 2^{-n R} \\ & \leqslant \epsilon+2^{-n[R-(H(X)+\epsilon)]} \leq 2 \epsilon \end{aligned} \]

for sufficiently large \(n\). Therefore, \(\epsilon_n \rightarrow 0\).

Problems:
(1) Encoding: How to encode? Variable-length or fixed-length?
(2) Rate: What is the rate? Does it depend on the block-length of the source?
(3) Decoding: How to decode?
(4) Error analysis: Is it zero or not?
(5) What knowledge do we need for the source?

lecture 02 end, Elements of information theory ch2 and ch3 are covered.

lecture 03 begin

Theorem 1
Let \(R>H(X)\). Then there exist fixed-length codes \(\left(\phi_n, \psi_n\right)\) such that \(R_n \leqslant R\) but \(\epsilon_n \rightarrow 0\). In the case when variable-length codes are allowed, we can make \(\epsilon_n=0\).
We prove this theorem by the following three methods:
(1) typical sequence,
(2) random binning,
(3) types.

Some Exercise

2.2 Entropy of functions.

Let \(X\) be a random variable taking on a finite number of values. What is the (general) inequality relationship of \(H(X)\) and \(H(Y)\) if
(a) \(Y=2^X\) ?
(b) \(Y=\cos X\) ?

\(H(X) \geq H(Y)\) if \(Y\) is a function of \(X\). This holds because a function may cause an information loss when the function is not an one-to-one mapping.

2.6 Conditional mutual information vs. unconditional mutual information.

Give examples of joint random variables \(X, Y\), and \(Z\) such that
(a) \(I(X ; Y \mid Z)<I(X ; Y)\).
(b) \(I(X ; Y \mid Z)>I(X ; Y)\).

(a) Let \(X=Y=Z\). All of them are binary.

Something counter intuition is that I may interpret \(I(X ; Y \mid Z)=0\) as given \(Z\), \(X\) and \(Y\) are independent, since \(I(X; Y)=0\) if and only if \(X\) and \(Y\) are independent.

Interpretation by ChatGPT

In the case where \(X=Y=Z\), the mutual information \(I(X;Y\mid Z)\) is zero because the two random variables \(X\) and \(Y\) are completely determined by the value of the random variable \(Z\). In other words, knowing the value of \(Z\) gives us complete information about both \(X\) and \(Y\), so there is no additional information to be gained by considering \(X\) and \(Y\) together. This means that the two random variables are independent given the value of \(Z\), so \(I(X;Y\mid Z)=0\).

(b) let \(X\) and \(Y\) be independent and \(Z = X +Y\). By intuition, given \(Z\) will cause a drop of independency between \(X\) and \(Y\).

2.28 Mixing increases entropy.

Show that the entropy of the probability distribution, \(\left(p_1, \ldots, p_i, \ldots, p_j, \ldots, p_m\right)\), is less than the entropy of the distribution \(\left(p_1, \ldots, \frac{p_i+p_j}{2}, \ldots, \frac{p_i+p_j}{2}\right.\), \(\left.\ldots, p_m\right)\). Show that in general any transfer of probability that makes the distribution more uniform increases the entropy.

2.30 Maximum entropy.

Find the probability mass function \(p(x)\) that maximizes the entropy \(H(X)\) of a nonnegative integer-valued random variable \(X\) subject to the constraint

\[E X=\sum_{n=0}^{\infty} n p(n)=A \]

for a fixed value \(A>0\). Evaluate this maximum \(H(X)\).

Sol. We tackle this problem by using Lagrange Multiplier.

\[\nabla f = \lambda \nabla g + \mu \nabla h \]

where

\[f = \sum p_i \log p_i \\ g = \sum p_i \\ h = \sum ip_i \]

which becomes

\[\left\{\begin{array}{l} 1+\log p_i=\lambda+i \mu \\ \sum p_{i}=1 \\ \sum i p_{i}=A \end{array}\right. \]

we can rewrite \(1+\log p_i=\lambda+i \mu\) in the form

\[\begin{array}{l} e^{\lambda-1}\left(e^{\mu}\right)^{i} \\ \end{array} \]

to simplify the notation, let

\[\alpha=e^{\lambda-1} \\ \beta=e^{\mu} \\ \]

so the rest of the restrictions becomes

\[\left\{\begin{array}{l} \sum_{i=0}^{\infty} \alpha \beta^{i}=1 \\ \sum_{i=0}^{\infty} i \alpha \beta^{i}=A \end{array}\right. \]

which implies that,

\[\begin{aligned} \beta & =\frac{A}{A+1} \\ \alpha & =\frac{1}{A+1} \end{aligned} \]

So the entropy maximizing distribution is,

\[p_i=\frac{1}{A+1}\left(\frac{A}{A+1}\right)^i . \]

Plugging these values into the expression for the maximum entropy,

\[-\log \alpha-A \log \beta=(A+1) \log (A+1)-A \log A \]

2.33 Fano's inequality.

Let \(\operatorname{Pr}(X=i)=p_i, i=1,2, \ldots, m\), and let \(p_1 \geq p_2 \geq p_3 \geq \cdots \geq p_m\). The minimal probability of error predictor of \(X\) is \(\hat{X}=1\), with resulting probability of error \(P_e=\) \(1-p_1\). Maximize \(H(\mathbf{p})\) subject to the constraint \(1-p_1=P_e\) to find a bound on \(P_e\) in terms of \(H\). This is Fano's inequality in the absence of conditioning.

Solution (Thomas M. Cover Joy A. Thomas): (Fano's Inequality.) The minimal probability of error predictor when there is no information is \(\hat{X}=1\), the most probable value of \(X\). The probability of error in this case is \(P_e=1-p_1\). Hence if we fix \(P_e\), we fix \(p_1\). We maximize the entropy of \(X\) for a given \(P_e\) to obtain an upper bound on the entropy for a given \(P_e\). The entropy,

\[\begin{aligned} H(\mathbf{p}) & =-p_1 \log p_1-\sum_{i=2}^m p_i \log p_i \\ & =-p_1 \log p_1-\sum_{i=2}^m P_e \frac{p_i}{P_e} \log \frac{p_i}{P_e}-P_e \log P_e \\ & =H\left(P_e\right)+P_e H\left(\frac{p_2}{P_e}, \frac{p_3}{P_e}, \ldots, \frac{p_m}{P_e}\right) \\ & \leq H\left(P_e\right)+P_e \log (m-1) \end{aligned} \]

since the maximum of \(H\left(\frac{p_2}{P_e}, \frac{p_3}{P_e}, \ldots, \frac{p_m}{P_e}\right)\) is attained by an uniform distribution. Hence any \(X\) that can be predicted with a probability of error \(P_e\) must satisfy

\[H(X) \leq H\left(P_e\right)+P_e \log (m-1) \]

which is the unconditional form of Fano's inequality. We can weaken this inequality to obtain an explicit lower bound for \(P_e\)

\[P_e \geq \frac{H(X)-1}{\log (m-1)} \]

Solution: (is this correct?)

\[\begin{aligned} H(X \mid \hat{X}) & =\operatorname{Pr}\{\hat{X}=1\} \cdot H(X \mid \hat{X}=1) \\ & =1 \cdot H(X \mid \hat{X}=1) \\ & =H(X) \end{aligned} \]

or

\[\begin{aligned} I(X ; \hat{X}) & =H(\hat{X})-H(\hat{X} \mid X) \\ & =H(X)-H(X \mid X) \end{aligned} \]

where $H(X) =0 $ because \(X = 1\) with probability \(1\) and \(H(X \mid \hat X ) \leq H(X)\) because condition reduces entropy. So \(H(X \mid \hat X) = H(X)\).

then

\[H(X \mid \hat X) = H(X) \leq H\left(P_e\right)+P_e \log (m-1). \]

2.46 Axiomatic definition of entropy (Difficult).

If we assume certain axioms for our measure of information, we will be forced to use a logarithmic measure such as entropy. Shannon used this to justify his initial definition of entropy. In this book we rely more on the other properties of entropy rather than its axiomatic derivation to justify its use. The following problem is considerably more difficult than the other problems in this section.
If a sequence of symmetric functions \(H_m\left(p_1, p_2, \ldots, p_m\right)\) satisfies the following properties:

  • Normalization: \(H_2\left(\frac{1}{2}, \frac{1}{2}\right)=1\),
  • Continuity: \(H_2(p, 1-p)\) is a continuous function of \(p\),
  • Grouping: \(H_m\left(p_1, p_2, \ldots, p_m\right)=H_{m-1}\left(p_1+p_2, p_3, \ldots, p_m\right)+\) \(\left(p_1+p_2\right) H_2\left(\frac{p_1}{p_1+p_2}, \frac{p_2}{p_1+p_2}\right)\) prove that \(H_m\) must be of the form

\[H_m\left(p_1, p_2, \ldots, p_m\right)=-\sum_{i=1}^m p_i \log p_i, \quad m=2,3, \ldots \]

There are various other axiomatic formulations which result in the same definition of entropy. See, for example, the book by Csiszár and Körner [149].

Reference

  • Elements of information theory/by Thomas M. Cover, Joy A. Thomas.–2nd ed.

标签:Information,right,mathbf,Theory,sum,Coding,mid,log,left
From: https://www.cnblogs.com/kion/p/16981039.html

相关文章