• No results found

Mathematical Models for Information Sources

N/A
N/A
Protected

Academic year: 2022

Share "Mathematical Models for Information Sources"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

UNIT II

Part I Introduction to Information Theory; Discrete Memoryless Sources; Information Measures; Source Coding Theorem;

Source Coding Techniques; Channel Capacity; Channel Coding and Channel Capacity Theorems

References

• John G. Proakis and Masoud Salehi, Digital Communication, 5edition [Chapter 6]

Other References:

• Thomas Cover,Elements of Information Theory,

• Simon Haykin, Communication Systems 4thEdition [Chapter 9]

(2)

3

Introduction

• Information Theory: Deal with fundamental limits on Communications:

Compression & Transmission

Compression: Source Coding

Transmission: Communication Channel modelling and Channel Capacity

(3)

Introduction

• Objective of source coding is to provide a encoding scheme which will transmit the information with reduce data size

Question: What are the information sources?

5

Introduction

• Objective of source coding is to provide a encoding scheme which will transmit the information with reduce data size

Question: What are the information sources?

(4)

Introduction

• Objective of source coding is to provide a encoding scheme which will transmit theinformationwith reduce data size

Question: What are the information sources?

Question: What is information content of the source?

Question: What is the measure of information ?

7

Source

• Communication systems are designed to transmit the information generated by a sourceto some destination

• Information sources may take a variety of different forms

• For example, in radio broadcasting, the source is generally an audio source (voice or music)

• In TV broadcasting, the information source is a video source whose output is a moving image

• Outputs of these sources are analog signals and, hence, the sources are called analog sources

• In contrast, computers and storage devices, such as magnetic or optical disks, produce discrete outputs (usually binary or ASCII characters), and hence are called discrete sources

(5)

Mathematical Models for Information Sources

• Any information source produces an output that is random; i.e., the source output is characterized in statistical terms.

• Otherwise, if the source output were known exactly, there would be no need to transmit it.

• Both discrete and analog information sources as well as mathematical models for each type of source will be discuss

9

Mathematical Models for Information Sources

• The simplest type of a discrete source is one that emits a sequence of letters selected from a finite alphabet.

• For example, a binary source emits a binary sequence of the form 100101110 . . . , where the alphabet consists of the two letters {0, 1}

(6)

Mathematical Models for Information Sources

• To construct a mathematical model for a discrete source, we assume that each letter in the alphabet {x1, x2, . . . , xL} has a given probability pk of occurrence.

• Try to relate {x1, x2, . . . , xL} in context of microphone output, image , keyboard output etc.,

11

Mathematical Models for Information Sources

Discrete Memoryless Source (DMS):

• Assume that the output sequence from the source is statistically independent.

• That is, the current output letter is statistically independent of all past and future outputs.

• A source whose output satisfies the condition of statistical independence among output letters is said to be memoryless

• If the source is discrete, it is called aDiscrete Memoryless source (DMS)

• The mathematical model for a DMS is a sequence of iid random variables {Xi}

(7)

Mathematical Models for Information Sources

• If the output of the discrete source is statistically dependent, such as English text, we may construct a mathematical model based on statistical stationarity

• By definition, a discrete source is said to be stationary if the joint probabilities for any arbitrary length sequence of source outputs are invariant under a shift in the time origin.

13

Mathematical Models for Information Sources

• Analog Sources: An analog source has an output waveform x(t) that is a sample function of a stochastic process X(t).

• We assume that X(t) is a stationary stochastic process with defined autocorrelation function

• By applying the sampling theorem, we may convert the output of an analog source to an equivalent discrete-time source

• We note that the output samples from the stationary sources are generally continuous, and hence they cannot be represented in digital form without some

(8)

Measure of Information

• What is measure of information

Hartley’s measure (1928)

• Information provided by observation of discrete random variableXis

𝑰 𝑿 = 𝐥𝐨𝐠𝒃n

Where nis the number of possible values of X

15

Measure of information

Hartley’s Measure (1928)

• Information provided by observation of discrete random variable Xis

𝑰 𝑿 = 𝐥𝐨𝐠𝒃n

Where n is the number of possible values of X

Hartley’s measure of information: Missed out the probability of each event in X

(9)

Measure of Information Shannon’ Measure

• If i-th possible values𝒳 has probability

p

i, then the Hartley information log(1/pi) should be weighted bypito give

-

𝑖=1𝑛

𝑝

𝑖

𝑙𝑜𝑔 𝑝

𝑖

amount of information provided by 𝒳.

• This is Shannon’s measure

17

Entropy

• The uncertainty

(entropy ) of a discrete of discrete random variable X is given by

𝐻 𝑋 = −

𝑥∈𝑠𝑢𝑝𝑝(𝑓)

𝑃𝑥 𝑥 𝑙𝑜𝑔𝑏𝑃𝑥 𝑥

(10)

Entropy

• The uncertainty

(entropy ) of a discrete of discrete random variable X is given by

𝐻 𝑋 = −

𝑥∈𝑠𝑢𝑝𝑝(𝑓)

𝑃𝑥 𝑥 𝑙𝑜𝑔𝑏𝑃𝑥 𝑥

𝐻 𝑋 = 𝐸 −𝑙𝑜𝑔𝑏𝑃𝑥 𝑥

19

Entropy

• The uncertainty

(entropy ) of a discrete of discrete random variable X is given by

𝐻 𝑋 = −

𝑥∈𝑠𝑢𝑝𝑝(𝑓)

𝑃𝑥 𝑋 = 𝑥 𝑙𝑜𝑔𝑏𝑃𝑥 𝑥

𝐻 𝑋 = 𝐸 −𝑙𝑜𝑔𝑏𝑃𝑥 𝑥

• Now relate information and uncertainty

(11)

Entropy

• Example : Suppose that X has two possible values x1, x2

• P(x1)=p

• P(x2)=1-p

• Then the uncertainty of X in bits, provided that 0<p<1

𝐻 𝑋 = −𝑝log

2

𝑝 − 1 − 𝑝 log

2

(1 − 𝑝)

21

Entropy

• Binary entropy function and denoted by H(p)

(12)

Entropy

The message chosen by an information source will simply be a random variable X that takes on one of the values x1, x2,…, xn with probabilities p1,p2,….,pnrespectively

Definition:

23

Entropy

• H(X): entropy of the random variable X and is a measure of uncertainty or ambiguity in X

• Since knowledge of X completely removes uncertainty about it

H(X) is also a measure of information that is acquired by knowledge of X, or the information content of X per source output

• Unit for entropy is bits (or nats) per symbol, or per source output.

(13)

Entropy

• we use the binary logarithm here

• Entropy is measured in the unit of "bits“

Example:

• The entropy of throwing a fair coin is exactly 1.0 bit.

• Throwing a two-headed coin gives 0.0 bits of entropy

• The entropy of throwing a six-sided unbiased die is approximately 2.58 bits

25

Entropy

Properties:

If the discrete random variable has npossible values

0 ≤ H(X) ≤ log 2 n

(14)

Conditional Entropy

• Lets say that we have two random variables X and Y

• we know that Y = y

• H(X|Y=y): Conditional entropy of X given that Y = y

Definition:

where P(X = xilY = y) is the conditional probability that X = xigiven that Y = y

27

Conditional Entropy

• Conditional entropy of X given Y

Definition

where P(Y = yi) is the probability that Y = yi

Y is discrete Random variable takes on one of the values

y

1

, y

2

,…, y

m

(15)

Conditional Entropy

• Conditional entropy of X given Y

Definition

where P(Y =yi) is the probability that Y =yi

Y is discrete Random variable takes on one of the values y

1

, y

2

,…,y

m

29

Joint Entropy

• Joint entropy

(16)

Logarithmic measure of Mutual Information

Assume that we have two random variables X and Y

value of which are both unknown

If the value of Y is revealed to us, how much will that tell us about X?

We define the amount of information conveyed by doing so to be the corresponding decrease in entropy

31

A Logarithmic measure of information

Information content provided by the occurrence of the event Y = y about the event X = x is defined as

I (x; y) is called the mutual information between x and y

(17)

A Logarithmic measure of information

Information content provided by the occurrence of the event Y = y about the event X = x is defined as

I (x; y) is called the mutual information between x and y.

The mutual information between random variables X and Y is defined as the average of I (x; y) and is given by

33

A Logarithmic measure of information

Definition: The amount of information revealed about X when given Y is

I(X;Y ) = H(X) - H(X|Y)

= H(Y) - H(Y|X)

(18)

A Logarithmic measure of information

Definition: The amount of information revealed about X when given Y is

I(X;Y ) = H(X) - H(X|Y)

= H(Y) - H(Y|X)

𝑰 𝑿; 𝒀 =

𝒙∈𝖃 𝒚∈𝒀

𝑷 𝑿,𝒀 𝒙, 𝒚 𝒍𝒐𝒈 𝑷 𝑿,𝒀 𝒙, 𝒚 𝑷 𝑿 𝒙 𝑷 𝒀 𝒚

35

A Logarithmic measure of information

Some of the most important properties of the mutual

information:

(19)

• The most important properties of the entropy functions:

• Prove these properties

37

Some of the important properties of joint and conditional

entropy are summarized below

(20)

Example:A measure of information Let (X, Y) have the following joint distribution:

Calculate H(X), H(X/Y), H(X,Y) and I(X;Y)

39

Numerical Example Solution:

Then H(X) = H(1/8; 7/8) = 0.544 bit, H(X|Y = 1) = 0 bits, and H(X|Y = 2) = 1 bit

We calculate H(X|Y) = ¾ H(X|Y = 1) + ¼ H(X|Y = 2) = 0.25 bit.

Thus, the uncertainty in X is increased if Y = 2 is observed and decreased if Y = 1 is observed

But uncertainty decreases on the average

(21)

Example: A measure of information Let (X,Y) have the following joint distribution

Calculate H(X/Y), H(Y/X) and H(X,Y)

41

A logarithmic measure of information

Let (X,Y) have the following joint distribution

(22)

A Logarithmic measure of information

Example:

A single unbiased die is tossed once. If the face of the die is 1,2,3, or 4, an unbiased coin is tossed once. If the face of the die is 5 or 6, the coin is tossed twice. Find the information about the face of the die by number of heads obtained.

43

Example:A measure of information

Let X be a random variable that denotes whether the face of the die is 1,2,3, or 4 OR 5,6.

Let Y be the random variable that denotes the number of heads

X takes two values that :

x1: when face of the die is 1,2,3,4 x2: when face of the die is 5 or 6

Y takes three values y0,y1, y2 depending upon the whether there

are no heads, one head or two head

(23)

Example: A measure of information P(x1)=2/3; P(x2)=1/3

P(y0/x1)=P(y1/x1)= ½ P(y2/x1)=0

P(y0/x2)=P(y2/x2)=1/4; P(y1/x2)=1/2 Thus we have

P(y0)=5/12 P(y1)=1/2 P(y2)=1/12

45

Example: A measure of information H(X)=

H(Y)=1.325 bits H(Y|X)=1.167 bits H(X|Y)=

I(X;Y)= 0.158 bits

(24)

A Logarithmic measure of information

Example:

• A single unbiased die is tossed once. If the face of the die is 1,2,3, or 4, an unbiased coin is tossed twice. If the face of the die is 5 or 6, the coin is tossed twice. Find the information about the face of the die by number of heads obtained.

• Answer ?????

47

Differential Entropy of a Continuous Random variable

Differential Entropy: For a continuous random variable (or vector) X with probability denity function px, the differential entropy h(X) is defined as

(25)

Differential entropy for a Gaussian random variable

49

Mutual Information for Continuous Random Variables

• For both X & YContinuous

(26)

Relationship between entropy and mutual information

51

Example: Entropy

What is the minimum value of H(p1, . . . , pn) = H(p) as p ranges

over the set of n-dimensional probability vectors? Find all p’s that

achieve this minimum.

(27)

Example: A measure of information

What is the minimum value of H(p1, . . . , pn) = H(p) as p ranges over the set of n-dimensional probability vectors? Find all p’s that achieve this minimum.

Example: A measure of information

Let X be a random variable taking on a finite number of values.

What is the (general) inequality relationship of H(X) and H(Y) if (a) Y = 2

X

?

(b) Y = cosX ?

(28)

Example: A measure of information

Not for exam? Try this

Problem: Drawing with and without replacement. An urn contains

r red, w white, and b black balls. Which has higher entropy,

drawing k ≥ 2 balls from the urn with replacement or without replacement? Set it up and show why.

Tutorial Problems for this part:

Unsolved problems 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.11 and 6.15.

John G. Proakis and Masoud Salehi,Digital Communication, 5edition [Chapter 6]

(29)

Lossless Coding Algorithms

Data compression can be achieved by assigning short descriptions to the most frequent outcomes of the data source

Necessarily longer descriptions to the less frequent outcomes:

Variable-Length Source Coding

For example, in Morse code, the most frequent symbol is represented by a single dot

57

SHANNON'S FIRST THEOREM (LOSSLESS SOURCE CODING THEOREM)

SHANNON'S SOURCE CODING THEOREM

Let X denote a DMS with entropy X. There exists a lossless source code for this source at any rate r if R > H(X). There exists no lossless code for this source at rates less than H(X).

(30)

Source Coding

EXAMPLES OF CODES

Definition Asource code C for a random variable X is a mapping from X, the range of X, to D, the set of finite-length strings of symbols from a D-ary alphabet

• LetC(x) denote the codeword corresponding toX

l(x)denote the length ofC(x)

• For example, C(red) = 00, C(blue) = 11 is a source code for X = {red, blue}

with alphabetD={0,1}

59

EXAMPLES OF CODES

Definition The expected length L(C) of a source code C(x) for a random variable X with probability mass function p(x) is given by

• wherel(x) is the length of the codeword associated withx.

• Without loss of generality, we can assume that theD-ary alphabet is D= {0, 1, . . . , D− 1}.

NOTE: In Proakis book n𝐨𝐭𝐚𝐭𝐢𝐨𝐧 𝑹 is used instead of L

(31)

EXAMPLES

• Let X be a random variable with the following distribution and codeword assignment:

P(X = 1) = 1/2 , codewordC(1) = 0 P(X = 2) = 1/4, codewordC(2) = 10 P(X = 3) = 1/8, codewordC(3) = 110 P(X = 4) = 1/8, codewordC(4) = 111.

61

EXAMPLES

The entropy H(X) of X is 1.75 bits,

Expected length L(C) =E[l(X)] of this code is also 1.75 bits.

Here we have a code that has the same average length as the entropy.

We note that any sequence of bits can be uniquely decoded into a

sequence of symbols of X.

(32)

EXAMPLES

Consider another simple example of a code for a random variable:

P(X = 1) = 1/3, , codeword C(1) = 0 P(X = 2) = 1/3, codeword C(2) = 10 P(X = 3) = 1/3, , codeword C(3) = 11

63

EXAMPLES

Consider another simple example of a code for a random variable:

P(X = 1) = 1/3, , codeword C(1) = 0 P(X = 2) = 1/3, codeword C(2) = 10 P(X = 3) = 1/3,, codeword C(3) = 11

Just as in previous example, the code is uniquely decodable

However, in this case the entropy is log 3 = 1.58 bits

average length of the encoding = 1.66 bits.

Here E[l(X)] > H(X)

(33)

EXAMPLES of codes

Definition: A code is said to be nonsingular

if every element of the range of X maps into a different string in D∗; that is

65

EXAMPLES of codes

• Non singularity suffices for an unambiguous description of a single value of X

• But we usually wish to send a sequence of values of X

• In such cases we can ensure decodability by adding a special symbol (a

“comma”) between any two codewords

But this is an inefficient use of the special symbol

(34)

EXAMPLES of CODES

Defintion:

The extension Cof a code C is the mapping from finite length strings of X to finite-length strings of D, defined by

• where C(x1)C(x2) · · · C(xn) indicates concatenation of the corresponding codewords

Example If C(x1) = 00 and C(x2) = 11, then C(x1x2) = 0011.

• We also use notation xnto denote (x1, x2, . . . , xn)

67

EXAMPLES of CODES

Definition:

A code is called

uniquely decodable

if its extension is nonsingular.

In other words, any encoded string in a uniquely decodable code has only one possible source string producing it

However, one may have to look at the entire string to determine even

the first symbol in the corresponding source string

(35)

EXAMPLES of CODES

DefinitionA code is called aprefix code or aninstantaneous codeif no codeword is a prefix of any other codeword.

69

EXAMPLES of CODES

• An instantaneous code can be decoded without reference to future codewords since the end of a codeword is immediately recognizable

• Hence, for an instantaneous code, the symbolxi can be decoded as soon as we come to the end of the codeword corresponding to it

• We need not wait to see the codewords that come later.

(36)

EXAMPLES

• Let X be a random variable with the following distribution and codeword assignment:

P(X = 1) = 1/2 , codewordC(1) = 0 P(X = 2) = 1/4, codewordC(2) = 10 P(X = 3) = 1/8, codewordC(3) = 110 P(X = 4) = 1/8, codewordC(4) = 111

• Binary string 01011111010 produced by the code is parsed as 0,10,111,110,10

71

Classes of codes

(37)

Classes of codes

73

Kraft Inequality

Theorem: For any instantaneous code (prefix code) over an alphabet of size D, the codeword lengths l1, l2, . . . , lmmust satisfy the inequality

• Conversely, given a set of codeword lengths that satisfy this inequality,

(38)

OPTIMAL CODES

Consider the problem of finding the prefix code with the minimum expected length

Problem formulation:

minimize L subject to

75

OPTIMAL CODES

minimize L subject to

• A simple analysis by calculus suggests the form of the minimizing 𝑙𝑖

• We neglect the integer constraint on li

• Assume equality in the constraint Problem formulation

(39)

OPTIMAL CODES

• Hence, we can write the constrained minimization using Lagrange multipliers as the minimization of

77

OPTIMAL CODES

• Hence, we can write the constrained minimization using Lagrange multipliers as the minimization of

• Differentiating with respect to li, we obtain

(40)

OPTIMAL CODES

• Hence, we can write the constrained minimization using Lagrange multipliers as the minimization of

• Differentiating with respect to li, we obtain

• Setting the derivative to 0, we obtain

79

OPTIMAL CODES

Substituting this in the constraint to find λ

we find λ = 1/ loge D

Hence

• yielding optimal code lengths

• This nonintegerchoice of codeword lengths yields expected codeword length

(41)

OPTIMAL CODES

• This noninteger choice of codeword lengths yields expected codeword length

• But since the limust be integers

• we will not always be able to set the codeword lengths as

• Instead, we should choose a set of codeword lengths li“close” to the optimal set Rather than demonstrate by calculus that is a global minimum

• We verify optimality directly in the proof of the Shannon theorem

81

Data Compression

Theorem: The expected length L of any instantaneous D-ary code for a random variable X is greater than or equal to the entropy HD(X), that is,

with equality if and only if

(42)

SOURCE CODING THEOREM FOR PREFIX CODES

Bounds on the optimal code length: a code that achieves an expected description lengthLwithin 1 bit of the lower bound; that is,

83

Example of Lossless coding Algorithm

Huffman codes

• An optimal (shortest expected length) prefix code for a given distribution can be constructed by a simple algorithm discovered by Huffman

• Any other code for the same alphabet cannot have a lower expected length than the code constructed by the algorithm

(43)

Huffman codes: Example

• Consider a random variable X taking values in the set X = {s0, s1, s2, s3, s4}

with probabilities 0.4, 0.2, 0.2, 0.1, 0.1, respectively. Construct the optimal code.

85

Data Compression

Huffman codes: Example

• We expect the optimal binary code for X to have the longest codewords assigned to the symbols s3 and s4.

• These two lengths must be equal, since otherwise we can delete a bit from the longer codeword and still have a prefix code, but with a shorter expected length.

(44)

Data Compression

• In general, we can construct a code in which the two longest codewords differ only in the last bit.

• For this code, we can combine the symbols s3 and s4 into a single source symbol, with a probability assignment 0.20.

• Proceeding this way, combining the two least likely symbols into one symbol until we are finally left with only one symbol

• Then assigning codewords to the symbols

87

Huffman codes: Example

(45)

Huffman codes: Example

89

Huffman Coding Example

(46)

Huffman codes: Example

Average code word length L=2.2

Entropy H(X)=2.12.193 bits

91

Data Compression

Consider a random variable X taking values in the set X = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15, respectively. Construct the binary code using Huffman algorithm and calculate average codeword length.

(47)

Note: Book reading is recommended for the complete understanding of topics discussed in these slides.

Request:Please mail if you will find any typos in these slides

93

References

Related documents

An observation of Table 1 revealed that of the various sources of information utilized by the Shrimp farmers, private consultants were the majur SOurce of

Course description UNIT-I INFORMATION THEORY AND SOURCE CODING Introduction to Information Theory, Uncertainty and Information, Average Mutual Information and Entropy,

Transmission Over Time-Varying Channels: Types of Modern Fading Channels and Issues of Communicating Over Them; Estimation of Time Varying Channels Including Fast

This chapter presents a SI generation scheme for distributed video coding based on Mo- tion Compensated Frame Interpolation (MCFI).. The suggested scheme predicts a WZ frame from

Preservation Status: Dying tradition Preservative Measures: Nothing special Illustration (Photograph, etc,): Video.. Information

In wireless communication, identified channel attributes of a communication link called channel state information. The channel state information gives total

Space time coding technique also used in transmitting side of the multi-antenna MIMO system.Study and analysis of the effect of co-channel interference over

The performance analysis of the hybrid schemes are also made with respect to overall rate distortion, number requests per frame, temporal evaluation, and decoding time requirement