CS626-460: Speech, NLP and the p , Web

(1)

CS626-460: Speech, NLP and the p , Web

Pushpak Bhattacharyya

CSE Dept., IIT Bombay

Lecture 5,7: HMM and Viterbi 10

^th

and 16

^th

Jan, 2012

(Lecture 6 was on Computational Biomedicine research at Houston University by Prof. Ioannis)

(2)

HMM Definition

^Problem

Part of Speech Parsing

Semantics NLP Trinity

Set of states : S where |S|=N

Output Alphabet : O where |O|=K ^Hindi

Marathi English

French Morph

Analysis

p Tagging

HMM

Transition Probabilities : A = {a_ij}

prob. of going from state S_i to state S_j

E i i P b biliti B {b }

Algorithm

Language CRF MEMM

Emission Probabilities : B = {b_pq}

prob. of outputting symbol O_q from state S_p

Initial State Probabilities : πt a State obab t es

) ,

,

( π

λ = ( ^A , ^B , π )

λ ^A ^B

(3)

Markov Processes

Properties

Limited Horizon: Given previous t states a

Limited Horizon: Given previous t states, a state i, is independent of preceding 0 to t- k+1 states .

P(X_t=i|X_t-1, X_t-2,… X₀) = P(X_t=i|X_t-1, X_t-2… X_t-k)

Order k Markov process

Time invariance: (shown for k=1 )

P(X_t=i|X_t-1=j) = P(X₁=i|X₀=j) …= P(X_n=i|X_n-1=j)

(4)

Three basic problems (contd.)

Problem 1: Likelihood of a sequence

Forward Procedure

Backward Procedure

Problem 2: Best state sequence

Viterbi Algorithm

P bl 3 R ti ti

Problem 3: Re-estimation

Baum-Welch ( Forward-Backward Al ith )

Algorithm )

(5)

Probabilistic Inference

O: Observation Sequence

S: State Sequence

Given O find S^* where called

Probabilistic Inference

* arg max ( / )

S

S = p S O

Probabilistic Inference

Infer “Hidden” from “Observed”

S

Infer Hidden from Observed

How is this inference different from logical inference based on propositional or predicate calculus?

(6)

Essentials of Hidden Markov Model

1. Markov + Naive Bayes

2 Uses both transition and observation probability 2. Uses both transition and observation probability

1 1

(

_k ^O^k _k

) (

_k

/

_k

) (

_k

/

_k

) p S → S

₊

= p O S p S

₊

S

3. Effectively makes Hidden Markov Model a Finite State

1 1

(

_k _k

) (

_k _k

) (

_k _k

)

p

₊

p p

₊

y

Machine (FSM) with probability

(7)

Probability of Observation Sequence

( ) ( , )

S

p O = ∑ p O S

= ( ) ( / )

S

p S p O S

∑

Without any restriction,

Search space size= |S|

^|O|

(8)

Continuing with the Urn example

Colored Ball choosing

Urn 1

# of Red = 30

# of Green = 50

Urn 3

# of Red =60

# of Green =10 Urn 2

# of Red = 10

# of Green 40

# of Green 50

# of Blue = 20 # of Green 10

# of Blue = 30

# of Green = 40

# of Blue = 50

(9)

Example (contd.)

U₁ U₂ U₃

Gi

R G B

U 0 3 0 5 0 2

Transition Probability Observation/output Probability

U₁ 0.1 0.4 0.5 U₂ 0.6 0.2 0.2 U 0 3 0 4 0 3

Given :

and U₁ 0.3 0.5 0.2

U₂ 0.1 0.4 0.5 U₃ 0.6 0.1 0.3 U₃ 0.3 0.4 0.3

Observation : RRGGBRGR

U₃ 0.6 0.1 0.3

What is the corresponding state sequence ?

(10)

Diagrammatic representation (1/2)

G 0 5

0 3 0 3

B, 0.2 R, 0.3 G, 0.5

U₁ U₃

0.1

0.5

0.3 0.3

R, 0.6 0.2

0.4 0.6

0.4

G, 0.1 B, 0.3

U₂ R, 0.1

G, 0.4

0.2 B, 0.5

(11)

Diagrammatic representation (2/2) g p ( / )

R,0.18 G,0.03 R,0.03

G,0.05 B,0.02

U₁ U₃

R 0 02 R,0.15

G,0.25 B,0.10

B,0.09

R,0.18 G,0.03 B,0.09 R,0.02

G,0.08 B,0.10

R,0.24 G 0 04 R,0.06

G,0.24 B,0.30 R, 0.08

G 0 20

B,0.10 B,0.09

U₂

G,0.04 B,0.12 G, 0.20

B, 0.12

R,0.02, G,0.08 B,0.10

(12)

Observations and states

O

¹

O

²

O

³

O

⁴

O

⁵

O

⁶

O

⁷

O

⁸

O S G G G

OBS: R R G G B R G R State: S

¹

S

²

S

³

S

⁴

S

⁵

S

⁶

S

⁷

S

⁸

S

_i_i

= U

₁₁

/U /

₂₂

/U /

₃₃

; A particular state ; p S: State sequence

O: Observation sequence O: Observation sequence

S* = “best” possible state (urn) sequence

Goal: Maximize P(S|O) by choosing “best” S

Goal: Maximize P(S|O) by choosing best S

(13)

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation

Sequence and O is the Observation Sequence

))

| (

( max

arg

* P S O

S =

_S

(14)

Baye’s Theorem

) (

/ )

| (

).

( )

|

( A B P A P B A P B

P =

P(A) -: Prior

P(B|A) -: Likelihood P(B|A) : Likelihood

)

| (

).

( max

arg )

| ( max

arg max

_S

P ( S | O ) = arg max

_S

P ( S ). P ( O | S )

arg

_S

P S O

_S

P S P O S

(15)

State Transitions Probability

)

| ( )...

| ( ).

( )

(

) (

7 1 8 3

1 4 2

1 3 1

2 1

8 1

−

=

S S P S

S P S

P S

P

S P S

P

)

| ( )

| ( ) ( )

(

By Markov Assumption (k=1)

)

| ( )...

| ( ).

( )

( S P S

¹

P S

²

S

¹

P S

³

S

²

P S

⁴

S

³

P S

⁸

S

⁷

P =

(16)

Observation Sequence probability

) ,

| ( )...

,

| ( ).

,

| ( ).

| ( )

|

( O S = P O

¹

S

¹⁻⁸

P O

²

O

¹

S

¹⁻⁸

P O

³

O

¹⁻²

S

¹⁻⁸

P O

⁸

O

¹⁻⁷

S

¹⁻⁸

P

Assumption that ball drawn depends only Assumption that ball drawn depends only on the Urn chosen

)

| (

)

| (

)

| (

)

| (

)

|

( O S P O S P O S P O S P O S

P ( O | S ) P ( O

¹

| S

¹

). P ( O

²

| S

²

). P ( O

³

| S

³

)... P ( O

⁸

| S

⁸

)

P =

)

| ( ).

( )

|

( S O P S P O S

P =

)

| (

)...

| (

).

| (

).

| (

).

| (

)...

| (

).

| (

).

| (

).

( )

| (

8 8

3 3

2 2

1 1

7 8

3 4

2 3

1 2

1

S O

P S

O P S

O P

S S

P S

S P S

P O

S

P =

)

| (

)

| (

)

| (

)

|

(

(17)

Grouping terms

O₀ O₁ O₂ O₃ O₄ O₅ O₆ O₇ O₈

Obs: ε R R G G B R G R

S S S S S S S S S S

P(S).P(O|S)

= [P(O |S ) P(S |S )] We introduce the states S d S i i i l

State: S₀ S₁ S₂ S₃ S₄ S₅ S₆ S₇ S₈ S₉

= [P(O₀|S₀).P(S₁|S₀)].

[P(O₁|S₁). P(S₂|S₁)].

[P(O₂|S₂). P(S₃|S₂)].

S₀ and S₉ as initial and final states respectively.

[P(O₃|S₃).P(S₄|S₃)].

[P(O₄|S₄).P(S₅|S₄)].

[P(O₅|S₅).P(S₆|S₅)].

p y

After S₈ the next state is S₉ with probability 1 i e P(S |S ) 1

[ ( ₅| ₅) ( ₆| ₅)]

[P(O₆|S₆).P(S₇|S₆)].

[P(O₇|S₇).P(S₈|S₇)].

[P(O |S ) P(S |S )]

1, i.e., P(S₉|S₈)=1 O₀ is ε-transition

[P(O₈|S₈).P(S₉|S₈)].

(18)

Introducing useful notation

O₀ O₁ O₂ O₃ O₄ O₅ O₆ O₇ O₈

S S S S S S S S S S

R G G B R

S₀ S₁ S₇

S₂ S₃ S₄ S₅ S₆

ε R R G G B R

G S₈

O R

S₉

P(O_k|S_k).P(S_k+1|S_k)=P(S_kÆS^O^k _k+1)

(19)

Viterbi Algorithm for the Urn problem (first two symbols)

S₀ 0.5

0 3

0.2 ε

U₁ U₂ U₃

0.3

0 15 0.03

0 08

0.15

0.06

0.02

0.18 0.18

R

U₁ U₂ U₃

0.08

U₁ U₂ ^U3 U₁ U₂ ^U3

0.02 0.24

0.015 0.04 0.075* 0.018 0.006 0.006 0.048* 0.036

*: winner sequences

(20)

Markov process of order>1 (say 2)

O₀ O₁ O₂ O₃ O₄ O₅ O₆ O₇ O₈

S S S S S S S S S S

Same theory works

P(S) P(O|S) We introduce the states

S d S i i i l

P(S).P(O|S)

= P(O₀|S₀).P(S₁|S₀).

[P(O₁|S₁). P(S₂|S₁S₀)].

S₀ and S₉ as initial and final states respectively.

[P(O₂|S₂). P(S₃|S₂S₁)].

[P(O₃|S₃).P(S₄|S₃S₂)].

[P(O₄|S₄).P(S₅|S₄S₃)].

p y

After S₈ the next state is S₉ with probability 1 i e P(S |S S ) 1

[ ( ₄| ₄) ( ₅| _{4 3})]

[P(O₅|S₅).P(S₆|S₅S₄)].

[P(O₆|S₆).P(S₇|S₆S₅)].

[P(O |S ) P(S |S S )]

1, i.e., P(S₉|S₈S₇)=1 O₀ is ε-transition

[P(O₇|S₇).P(S₈|S₇S₆)].

[P(O₈|S₈).P(S₉|S₈S₇)].

(21)

Adjustments

Transition probability table will have tuples on rows and states on columns

rows and states on columns

Output probability table will remain the same

In the Viterbi tree the Markov process will

In the Viterbi tree, the Markov process will take effect from the 3

^rd

input symbol (εRR)

There will be 27 leaves, out of which There will be 27 leaves, out of which only 9 only 9 will remain

Sequences ending in same tuples q g p will be compared

Instead of U1, U2 and U3

U₁U₁, U₁U₂, U₁U₃, U₂U₁, U₂U₂,U₂U₃, U₃U₁,U₃U₂,U₃U₃

(22)

Probabilistic FSM Probabilistic FSM

(a₁:0.3)

(a₂:0.4)

(a₁:0.1) (a₂:0.4) (a₁:0.3)

(a₁:0 2) (a₁:0.1)

(a₂:0 2)

(a₁:0.3)

(a₂:0 2)

S₁ S₂

(a₁:0.2)

(a₂:0.3)

(a₂:0.2) (a₂:0.2)

The question here is:

“what is the most likely state sequence given the output sequence seen”

(23)

Developing the tree Developing the tree

Start

1 0 0 0 €

S1 S2

1.0 0.0

0.1 0.3 0.2 0.3

€

a₁

S1 S2 S1 S2

0.1 0.3 0.2 0.3

1*0.1=0.1 9. 0.3 9. 0.0 0.0

a₁

0 2 0 2

S1 S2 S1 S2

9. 9.

a₂

0.2 0.4 0.3 0.2

0.1*0.2=0.02 0.1*0.4=0.04 0.3*0.3=0.09 0.3*0.2=0.06 Choose the winning sequence per state per iteration

(24)

Tree structure contd

Tree structure contd…

S1 S2

0.09 0.06

S1 S2

S1 S2 S1 S2

0.1 0.3 0.2 0.3

0.027 9. 0.012 9.

0.09*0.1=0.009 0.018

a₁

0.3 0.2 0.2 0.4 a₂

S1 S2 S2

0 00 8 9. S1

0.0081 0.0054 0.0024 0.0048

Th bl b i dd d b thi t i S* = argmax P(S | a¹ a² a¹ a² )

The problem being addressed by this tree is S* argmax P(S | a¹ a² a¹ a²^,^μ)

s

−

=

a1-a2-a1-a2 is the output sequence and μ the model or the machine

(25)

P th f d

^S¹ ^S² ^S¹ ^S² ^S¹

Path found

^:

(working backward)

S₁ S₂ S₁ S₂ S₁ a₂

a₁ a₁ a₂

Problem statement

: Find the best possible sequence

) ,

| max (

* arg P S O μ S = arg max P ( S | O , μ ) S

s

Machine or

Model Seq,

Output Seq,

State

, S → O → μ →

where

} , , , { Machine

or

Model = S

⁰

S A T

Start symbol State collection Alphabet set

Transitions

T is defined as P ( S

ⁱ

⎯ ⎯→ a

^k

S

^j

) ∀

ⁱ^, ^j^, ^k

(26)

Tabular representation of the tree

€ a₁ a₂ a₁ a₂

Latest symbol observed

S₁ ^1.0 (1.0*0.1,0.0*0.2

)=(0.1,0.0) (0.02,

0.09) (0.009, 0.012) (0.0024, 0.0081)

Ending state

S₂ ^0.0 (1.0*0.3,0.0*0.3

)=(0.3,0.0) (0.04,0.0

6) (0.027,0.018) (0.0048,0.005 4)

Note: Every cell records the winning probability ending in that state Note: Every cell records the winning probability ending in that state Final winner The bold faced values in each cell shows the

sequence probability ending in that state Going backward sequence probability ending in that state. Going backward from final winner sequence which ends in state S2 (indicated By the 2^nd tuple), we recover the sequence.

(27)

Algorithm

(f ll l l d d

(following James Alan, Natural Language Understanding (2

^nd

edition), Benjamin Cummins (pub.), 1995

Given:

1. The HMM, which means:

a Start State: S₁

a. Start State: S₁

b. Alphabet: A = {a₁, a₂, … a_p}

c. Set of States: S = {S₁, S₂, … S_n}

d. Transition probabilityp y ^P⁽^Sⁱ ^⎯^{⎯ →}^a^k ^S^j⁾ ^∀ⁱ^, ^j^,^k which is equal to

2. The output string a₁a₂…a_T

)

| ,

(S ^j a ^k S ⁱ P

To find:

The most likely sequence of states C₁C₂…C_Twhich produces the given output sequence, i.e., C₁C₂…C_T = ^arg^max^[ ⁽ ^| ¹^, ²^,... ^T^,^μ^]

C

a a a C P

C

(28)

Algorithm contd…

Data Structure: ata St uctu e

1. A N*T array called SEQSCORE to maintain the

winner sequence always (N=#states, T=length of o/p sequence)

o/p sequence)

2. Another N*T array called BACKPTR to recover the path.

Three distinct steps in the Viterbi implementation

Initialization

1. Initialization

2. Iteration

3. Sequence Identificationq

(29)

1. Initialization

SEQSCORE(1,1)=1.0 BACKPTR(1,1)=0

For(i=2 to N) do

SEQSCORE(i,1)=0.0

[expressing the fact that first state [expressing the fact that first state

is S₁]

2 Iteration 2. Iteration

For(t=2 to T) do For(i 1 to N) do For(i=1 to N) do

SEQSCORE(i,t) = Max_(j=1,N)

)]

(

* )) 1 (

(

[SEQSCORE j t − P Sj ⎯⎯ →a^k Si

BACKPTR(I,t) = index j that gives the MAX above

)]

( ))

1 (

, (

[SEQSCORE j t P Sj → Si

(30)

3. Seq. Identification q

C(T) = i that maximizes SEQSCORE(i,T) For i from (T-1) to 1 do

For i from (T 1) to 1 do

C(i) = BACKPTR[C(i+1),(i+1)]

Optimizations possible:

1. BACKPTR can be 1*T 2. SEQSCORE can be T*2

Homework:‐ Compare this with A*, Beam Search [Homework]

Reason for this comparison:

Both of them work for finding and recovering sequence Both of them work for finding and recovering sequence

CS626-460: Speech, NLP and the p , Web