UNIT II
• Part I Introduction to Information Theory; Discrete Memoryless Sources; Information Measures; Source Coding Theorem;
• Source Coding Techniques; Channel Capacity; Channel Coding and Channel Capacity Theorems
References
• John G. Proakis and Masoud Salehi, Digital Communication, 5edition [Chapter 6]
Other References:
• Thomas Cover,Elements of Information Theory,
• Simon Haykin, Communication Systems 4thEdition [Chapter 9]
3
Introduction
• Information Theory: Deal with fundamental limits on Communications:
Compression & Transmission
• Compression: Source Coding
• Transmission: Communication Channel modelling and Channel Capacity
Introduction
• Objective of source coding is to provide a encoding scheme which will transmit the information with reduce data size
• Question: What are the information sources?
5
Introduction
• Objective of source coding is to provide a encoding scheme which will transmit the information with reduce data size
• Question: What are the information sources?
Introduction
• Objective of source coding is to provide a encoding scheme which will transmit theinformationwith reduce data size
• Question: What are the information sources?
• Question: What is information content of the source?
• Question: What is the measure of information ?
7
Source
• Communication systems are designed to transmit the information generated by a sourceto some destination
• Information sources may take a variety of different forms
• For example, in radio broadcasting, the source is generally an audio source (voice or music)
• In TV broadcasting, the information source is a video source whose output is a moving image
• Outputs of these sources are analog signals and, hence, the sources are called analog sources
• In contrast, computers and storage devices, such as magnetic or optical disks, produce discrete outputs (usually binary or ASCII characters), and hence are called discrete sources
Mathematical Models for Information Sources
• Any information source produces an output that is random; i.e., the source output is characterized in statistical terms.
• Otherwise, if the source output were known exactly, there would be no need to transmit it.
• Both discrete and analog information sources as well as mathematical models for each type of source will be discuss
9
Mathematical Models for Information Sources
• The simplest type of a discrete source is one that emits a sequence of letters selected from a finite alphabet.
• For example, a binary source emits a binary sequence of the form 100101110 . . . , where the alphabet consists of the two letters {0, 1}
•
Mathematical Models for Information Sources
• To construct a mathematical model for a discrete source, we assume that each letter in the alphabet {x1, x2, . . . , xL} has a given probability pk of occurrence.
• Try to relate {x1, x2, . . . , xL} in context of microphone output, image , keyboard output etc.,
11
Mathematical Models for Information Sources
• Discrete Memoryless Source (DMS):
• Assume that the output sequence from the source is statistically independent.
• That is, the current output letter is statistically independent of all past and future outputs.
• A source whose output satisfies the condition of statistical independence among output letters is said to be memoryless
• If the source is discrete, it is called aDiscrete Memoryless source (DMS)
• The mathematical model for a DMS is a sequence of iid random variables {Xi}
Mathematical Models for Information Sources
• If the output of the discrete source is statistically dependent, such as English text, we may construct a mathematical model based on statistical stationarity
• By definition, a discrete source is said to be stationary if the joint probabilities for any arbitrary length sequence of source outputs are invariant under a shift in the time origin.
13
Mathematical Models for Information Sources
• Analog Sources: An analog source has an output waveform x(t) that is a sample function of a stochastic process X(t).
• We assume that X(t) is a stationary stochastic process with defined autocorrelation function
• By applying the sampling theorem, we may convert the output of an analog source to an equivalent discrete-time source
• We note that the output samples from the stationary sources are generally continuous, and hence they cannot be represented in digital form without some
Measure of Information
• What is measure of information
Hartley’s measure (1928)
• Information provided by observation of discrete random variableXis
𝑰 𝑿 = 𝐥𝐨𝐠𝒃n
Where nis the number of possible values of X
15
Measure of information
Hartley’s Measure (1928)• Information provided by observation of discrete random variable Xis
𝑰 𝑿 = 𝐥𝐨𝐠𝒃n
Where n is the number of possible values of X
• Hartley’s measure of information: Missed out the probability of each event in X
Measure of Information Shannon’ Measure
• If i-th possible values𝒳 has probability
p
i, then the Hartley information log(1/pi) should be weighted bypito give-
𝑖=1𝑛𝑝
𝑖𝑙𝑜𝑔 𝑝
𝑖amount of information provided by 𝒳.
• This is Shannon’s measure
17
Entropy
• The uncertainty
(entropy ) of a discrete of discrete random variable X is given by
𝐻 𝑋 = −
𝑥∈𝑠𝑢𝑝𝑝(𝑓)
𝑃𝑥 𝑥 𝑙𝑜𝑔𝑏𝑃𝑥 𝑥
Entropy
• The uncertainty
(entropy ) of a discrete of discrete random variable X is given by
𝐻 𝑋 = −
𝑥∈𝑠𝑢𝑝𝑝(𝑓)
𝑃𝑥 𝑥 𝑙𝑜𝑔𝑏𝑃𝑥 𝑥
𝐻 𝑋 = 𝐸 −𝑙𝑜𝑔𝑏𝑃𝑥 𝑥
19
Entropy
• The uncertainty
(entropy ) of a discrete of discrete random variable X is given by
𝐻 𝑋 = −
𝑥∈𝑠𝑢𝑝𝑝(𝑓)
𝑃𝑥 𝑋 = 𝑥 𝑙𝑜𝑔𝑏𝑃𝑥 𝑥
𝐻 𝑋 = 𝐸 −𝑙𝑜𝑔𝑏𝑃𝑥 𝑥
• Now relate information and uncertainty
Entropy
• Example : Suppose that X has two possible values x1, x2
• P(x1)=p
• P(x2)=1-p
• Then the uncertainty of X in bits, provided that 0<p<1
𝐻 𝑋 = −𝑝log
2𝑝 − 1 − 𝑝 log
2(1 − 𝑝)
21
Entropy
• Binary entropy function and denoted by H(p)
Entropy
The message chosen by an information source will simply be a random variable X that takes on one of the values x1, x2,…, xn with probabilities p1,p2,….,pnrespectively
Definition:
23
Entropy
• H(X): entropy of the random variable X and is a measure of uncertainty or ambiguity in X
• Since knowledge of X completely removes uncertainty about it
•H(X) is also a measure of information that is acquired by knowledge of X, or the information content of X per source output
• Unit for entropy is bits (or nats) per symbol, or per source output.
Entropy
• we use the binary logarithm here
• Entropy is measured in the unit of "bits“
Example:
• The entropy of throwing a fair coin is exactly 1.0 bit.
• Throwing a two-headed coin gives 0.0 bits of entropy
• The entropy of throwing a six-sided unbiased die is approximately 2.58 bits
25
Entropy
• Properties:
If the discrete random variable has npossible values
0 ≤ H(X) ≤ log 2 n
Conditional Entropy
• Lets say that we have two random variables X and Y
• we know that Y = y
• H(X|Y=y): Conditional entropy of X given that Y = y
• Definition:
where P(X = xilY = y) is the conditional probability that X = xigiven that Y = y
27
Conditional Entropy
• Conditional entropy of X given Y
• Definition
where P(Y = yi) is the probability that Y = yi
Y is discrete Random variable takes on one of the values
y
1, y
2,…, y
mConditional Entropy
• Conditional entropy of X given Y
• Definition
where P(Y =yi) is the probability that Y =yi
Y is discrete Random variable takes on one of the values y
1, y
2,…,y
m29
Joint Entropy
• Joint entropy
Logarithmic measure of Mutual Information
•
Assume that we have two random variables X and Y
•
value of which are both unknown
•
If the value of Y is revealed to us, how much will that tell us about X?
•
We define the amount of information conveyed by doing so to be the corresponding decrease in entropy
31
A Logarithmic measure of information
•
Information content provided by the occurrence of the event Y = y about the event X = x is defined as
•
I (x; y) is called the mutual information between x and y
A Logarithmic measure of information
• Information content provided by the occurrence of the event Y = y about the event X = x is defined as
• I (x; y) is called the mutual information between x and y.
•
The mutual information between random variables X and Y is defined as the average of I (x; y) and is given by
33
A Logarithmic measure of information
•
Definition: The amount of information revealed about X when given Y is
I(X;Y ) = H(X) - H(X|Y)
= H(Y) - H(Y|X)
A Logarithmic measure of information
•
Definition: The amount of information revealed about X when given Y is
I(X;Y ) = H(X) - H(X|Y)
= H(Y) - H(Y|X)
𝑰 𝑿; 𝒀 =
𝒙∈𝖃 𝒚∈𝒀
𝑷 𝑿,𝒀 𝒙, 𝒚 𝒍𝒐𝒈 𝑷 𝑿,𝒀 𝒙, 𝒚 𝑷 𝑿 𝒙 𝑷 𝒀 𝒚
35
A Logarithmic measure of information
•
Some of the most important properties of the mutual
information:
• The most important properties of the entropy functions:
• Prove these properties
37
•
Some of the important properties of joint and conditional
entropy are summarized below
Example:A measure of information Let (X, Y) have the following joint distribution:
Calculate H(X), H(X/Y), H(X,Y) and I(X;Y)
39
Numerical Example Solution:
•
Then H(X) = H(1/8; 7/8) = 0.544 bit, H(X|Y = 1) = 0 bits, and H(X|Y = 2) = 1 bit
We calculate H(X|Y) = ¾ H(X|Y = 1) + ¼ H(X|Y = 2) = 0.25 bit.
Thus, the uncertainty in X is increased if Y = 2 is observed and decreased if Y = 1 is observed
•
But uncertainty decreases on the average
Example: A measure of information Let (X,Y) have the following joint distribution
Calculate H(X/Y), H(Y/X) and H(X,Y)
41
A logarithmic measure of information
Let (X,Y) have the following joint distribution
A Logarithmic measure of information
Example:•
A single unbiased die is tossed once. If the face of the die is 1,2,3, or 4, an unbiased coin is tossed once. If the face of the die is 5 or 6, the coin is tossed twice. Find the information about the face of the die by number of heads obtained.
43
Example:A measure of information
•
Let X be a random variable that denotes whether the face of the die is 1,2,3, or 4 OR 5,6.
•
Let Y be the random variable that denotes the number of heads
•
X takes two values that :
x1: when face of the die is 1,2,3,4 x2: when face of the die is 5 or 6
•
Y takes three values y0,y1, y2 depending upon the whether there
are no heads, one head or two head
Example: A measure of information P(x1)=2/3; P(x2)=1/3
P(y0/x1)=P(y1/x1)= ½ P(y2/x1)=0
P(y0/x2)=P(y2/x2)=1/4; P(y1/x2)=1/2 Thus we have
P(y0)=5/12 P(y1)=1/2 P(y2)=1/12
45
Example: A measure of information H(X)=
H(Y)=1.325 bits H(Y|X)=1.167 bits H(X|Y)=
I(X;Y)= 0.158 bits
A Logarithmic measure of information
Example:• A single unbiased die is tossed once. If the face of the die is 1,2,3, or 4, an unbiased coin is tossed twice. If the face of the die is 5 or 6, the coin is tossed twice. Find the information about the face of the die by number of heads obtained.
• Answer ?????
47
Differential Entropy of a Continuous Random variable
• Differential Entropy: For a continuous random variable (or vector) X with probability denity function px, the differential entropy h(X) is defined as
Differential entropy for a Gaussian random variable
49
Mutual Information for Continuous Random Variables
• For both X & YContinuous
•
Relationship between entropy and mutual information
51
Example: Entropy
•
What is the minimum value of H(p1, . . . , pn) = H(p) as p ranges
over the set of n-dimensional probability vectors? Find all p’s that
achieve this minimum.
Example: A measure of information
•
What is the minimum value of H(p1, . . . , pn) = H(p) as p ranges over the set of n-dimensional probability vectors? Find all p’s that achieve this minimum.
Example: A measure of information
•
Let X be a random variable taking on a finite number of values.
What is the (general) inequality relationship of H(X) and H(Y) if (a) Y = 2
X?
(b) Y = cosX ?
Example: A measure of information
Not for exam? Try thisProblem: Drawing with and without replacement. An urn contains
r red, w white, and b black balls. Which has higher entropy,
drawing k ≥ 2 balls from the urn with replacement or without replacement? Set it up and show why.
Tutorial Problems for this part:
Unsolved problems 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.11 and 6.15.
John G. Proakis and Masoud Salehi,Digital Communication, 5edition [Chapter 6]
Lossless Coding Algorithms
•
Data compression can be achieved by assigning short descriptions to the most frequent outcomes of the data source
•
Necessarily longer descriptions to the less frequent outcomes:
Variable-Length Source Coding
•
For example, in Morse code, the most frequent symbol is represented by a single dot
57
SHANNON'S FIRST THEOREM (LOSSLESS SOURCE CODING THEOREM)
• SHANNON'S SOURCE CODING THEOREM
Let X denote a DMS with entropy X. There exists a lossless source code for this source at any rate r if R > H(X). There exists no lossless code for this source at rates less than H(X).
Source Coding
EXAMPLES OF CODES• Definition Asource code C for a random variable X is a mapping from X, the range of X, to D∗, the set of finite-length strings of symbols from a D-ary alphabet
• LetC(x) denote the codeword corresponding toX
• l(x)denote the length ofC(x)
• For example, C(red) = 00, C(blue) = 11 is a source code for X = {red, blue}
with alphabetD={0,1}
59
EXAMPLES OF CODES
• Definition The expected length L(C) of a source code C(x) for a random variable X with probability mass function p(x) is given by
• wherel(x) is the length of the codeword associated withx.
• Without loss of generality, we can assume that theD-ary alphabet is D= {0, 1, . . . , D− 1}.
• NOTE: In Proakis book n𝐨𝐭𝐚𝐭𝐢𝐨𝐧 𝑹 is used instead of L
EXAMPLES
• Let X be a random variable with the following distribution and codeword assignment:
P(X = 1) = 1/2 , codewordC(1) = 0 P(X = 2) = 1/4, codewordC(2) = 10 P(X = 3) = 1/8, codewordC(3) = 110 P(X = 4) = 1/8, codewordC(4) = 111.
61
EXAMPLES
•
The entropy H(X) of X is 1.75 bits,
•
Expected length L(C) =E[l(X)] of this code is also 1.75 bits.
•
Here we have a code that has the same average length as the entropy.
•
We note that any sequence of bits can be uniquely decoded into a
sequence of symbols of X.
EXAMPLES
•
Consider another simple example of a code for a random variable:
P(X = 1) = 1/3, , codeword C(1) = 0 P(X = 2) = 1/3, codeword C(2) = 10 P(X = 3) = 1/3, , codeword C(3) = 11
63
EXAMPLES
•
Consider another simple example of a code for a random variable:
P(X = 1) = 1/3, , codeword C(1) = 0 P(X = 2) = 1/3, codeword C(2) = 10 P(X = 3) = 1/3,, codeword C(3) = 11
•
Just as in previous example, the code is uniquely decodable
•
However, in this case the entropy is log 3 = 1.58 bits
•
average length of the encoding = 1.66 bits.
•
Here E[l(X)] > H(X)
EXAMPLES of codes
• Definition: A code is said to be nonsingular
if every element of the range of X maps into a different string in D∗; that is
65
EXAMPLES of codes
• Non singularity suffices for an unambiguous description of a single value of X
• But we usually wish to send a sequence of values of X
• In such cases we can ensure decodability by adding a special symbol (a
“comma”) between any two codewords
• But this is an inefficient use of the special symbol
EXAMPLES of CODES
•
Defintion:
The extension C∗of a code C is the mapping from finite length strings of X to finite-length strings of D, defined by• where C(x1)C(x2) · · · C(xn) indicates concatenation of the corresponding codewords
• Example If C(x1) = 00 and C(x2) = 11, then C(x1x2) = 0011.
• We also use notation xnto denote (x1, x2, . . . , xn)
67
EXAMPLES of CODES
•Definition:
A code is called
uniquely decodableif its extension is nonsingular.
•
In other words, any encoded string in a uniquely decodable code has only one possible source string producing it
•
However, one may have to look at the entire string to determine even
the first symbol in the corresponding source string
EXAMPLES of CODES
• DefinitionA code is called aprefix code or aninstantaneous codeif no codeword is a prefix of any other codeword.
69
EXAMPLES of CODES
• An instantaneous code can be decoded without reference to future codewords since the end of a codeword is immediately recognizable
• Hence, for an instantaneous code, the symbolxi can be decoded as soon as we come to the end of the codeword corresponding to it
• We need not wait to see the codewords that come later.
•
EXAMPLES
• Let X be a random variable with the following distribution and codeword assignment:
P(X = 1) = 1/2 , codewordC(1) = 0 P(X = 2) = 1/4, codewordC(2) = 10 P(X = 3) = 1/8, codewordC(3) = 110 P(X = 4) = 1/8, codewordC(4) = 111
• Binary string 01011111010 produced by the code is parsed as 0,10,111,110,10
71
Classes of codes
Classes of codes
73
Kraft Inequality
• Theorem: For any instantaneous code (prefix code) over an alphabet of size D, the codeword lengths l1, l2, . . . , lmmust satisfy the inequality
• Conversely, given a set of codeword lengths that satisfy this inequality,
OPTIMAL CODES
•
Consider the problem of finding the prefix code with the minimum expected length
•
Problem formulation:
minimize L subject to
75
OPTIMAL CODES
minimize L subject to
• A simple analysis by calculus suggests the form of the minimizing 𝑙𝑖∗
• We neglect the integer constraint on li
• Assume equality in the constraint Problem formulation
OPTIMAL CODES
• Hence, we can write the constrained minimization using Lagrange multipliers as the minimization of
77
OPTIMAL CODES
• Hence, we can write the constrained minimization using Lagrange multipliers as the minimization of
• Differentiating with respect to li, we obtain
OPTIMAL CODES
• Hence, we can write the constrained minimization using Lagrange multipliers as the minimization of
• Differentiating with respect to li, we obtain
• Setting the derivative to 0, we obtain
79
OPTIMAL CODES
• Substituting this in the constraint to find λ
• we find λ = 1/ loge D
• Hence
• yielding optimal code lengths
• This nonintegerchoice of codeword lengths yields expected codeword length
OPTIMAL CODES
• This noninteger choice of codeword lengths yields expected codeword length
• But since the limust be integers
• we will not always be able to set the codeword lengths as
• Instead, we should choose a set of codeword lengths li“close” to the optimal set Rather than demonstrate by calculus that is a global minimum
• We verify optimality directly in the proof of the Shannon theorem
81
Data Compression
Theorem: The expected length L of any instantaneous D-ary code for a random variable X is greater than or equal to the entropy HD(X), that is,
with equality if and only if
SOURCE CODING THEOREM FOR PREFIX CODES
• Bounds on the optimal code length: a code that achieves an expected description lengthLwithin 1 bit of the lower bound; that is,
83
Example of Lossless coding Algorithm
Huffman codes
• An optimal (shortest expected length) prefix code for a given distribution can be constructed by a simple algorithm discovered by Huffman
• Any other code for the same alphabet cannot have a lower expected length than the code constructed by the algorithm
• Huffman codes: Example
• Consider a random variable X taking values in the set X = {s0, s1, s2, s3, s4}
with probabilities 0.4, 0.2, 0.2, 0.1, 0.1, respectively. Construct the optimal code.
85
Data Compression
Huffman codes: Example• We expect the optimal binary code for X to have the longest codewords assigned to the symbols s3 and s4.
• These two lengths must be equal, since otherwise we can delete a bit from the longer codeword and still have a prefix code, but with a shorter expected length.
•
Data Compression
• In general, we can construct a code in which the two longest codewords differ only in the last bit.
• For this code, we can combine the symbols s3 and s4 into a single source symbol, with a probability assignment 0.20.
• Proceeding this way, combining the two least likely symbols into one symbol until we are finally left with only one symbol
• Then assigning codewords to the symbols
87
Huffman codes: Example
Huffman codes: Example
89
Huffman Coding Example
Huffman codes: Example
• Average code word length L=2.2
• Entropy H(X)=2.12.193 bits
91
Data Compression
• Consider a random variable X taking values in the set X = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15, respectively. Construct the binary code using Huffman algorithm and calculate average codeword length.
Note: Book reading is recommended for the complete understanding of topics discussed in these slides.
• Request:Please mail if you will find any typos in these slides
93