• URN 1 • URN 2 • URN 3

(1)

HMM

(2)

An Example: Choosing Colored balls from 3 urns

• URN 1 • URN 2 • URN 3

Probability of transition from one Urn to another

URN 1 URN 2 URN 3

URN 1 0.1 0.4 0.5

URN 2 0.6 0.2 0.2

URN 3 0.3 0.4 0.3

(3)

• URN 1

• Red=30

• Green=50

• Blue =20

• URN 2

• Red=10

• Green=40

• Blue=50

• URN 1

• Red=60

• Green=10

• Blue=30

Probability drawing a ball

R G B

URN 1 0.3 0.5 0.2

URN 2 0.1 0.4 0.5

URN 3 0.6 0.1 0.3

(4)

U₃ U₁

U₂ R, 0.3 G, 0.5

B, 0.2

0.3

0.1 0.5

0.4 0.4 0.2

0.6

0.2

R, 0.6

(5)

U₃ U₁

U₂

R, 0.03 G, 0.05 B, 0.02

R, 0.18 G, 0.03 B, 0.09

R, 0.24 G, 0.04 B, 0.12 R, 0.02

G, 0.08 B, 0.10

R, 0.02 G, 0.08 B, 0.10 R, 0.08

G, 0.20 B, 0.12

R, 0.06 G, 0.24 B, 0.30

R, 0.15 G, 0.25 B, 0.10

(6)

Probability drawing a ball

(B)

R G B

URN 1 0.3 0.5 0.2 URN 2 0.1 0.4 0.5 URN 3 0.6 0.1 0.3

Probability of transition from one Urn to another

(A)

URN 1 URN 2 URN 3

URN 1 0.1 0.4 0.5

URN 2 0.6 0.2 0.2

URN 3 0.3 0.4 0.3

(7)

Observation and state sequence

o

₁

o

₂

o

₃

o

₄

o

₅

o

₆

o

₇

o

₈

Obs.: R R G G B R G R

Sates: q

₁

q

₂

q

₃

q

₄

q

₅

q

₆

q

₇

q

₈

q

_i

= U

₁

/U

₂

/U

₃

; any particular state

Object: To fine the best possible state sequence

that maximizes P(Q*|O) by choosing the best Q

(8)

Goal

(9)

o

₁

o

₂

o

₃

o

₄

o

₅

o

₆

o

₇

o

₈

Obs.: R R G G B R G R

Sates: q

₁

q

₂

q

₃

q

₄

q

₅

q

₆

q

₇

q

₈

(10)

Baye’s Theorem

(11)

(12)

Probability: Observation Sequence

(13)

Markov Assumption

• The Markov assumption states that probability of the occurrence of word w

_i

at time t

depends only on occurrence of word w

_i-1

at time t-1

– Chain rule:

– Markov assumption:



P(w₁,...,w_n)  P(w_i | w_i₁)

i2



n



P(w₁,...,w_n)  P(w_i | w₁,...,w_i₁)

i2



n

(14)

The Trellis

(15)

Parameters of an HMM

• States: A set of states S=s₁,…,s_n

• Transition probabilities: A= a_1,1,a_1,2,…,a_n,n each a_i,j

represents the probability of transitioning from state s_i to s_j.

• Emission probabilities: A set B of functions of the form b_i(o_t) which is the probability of observation o_t being emitted by s_i.

• Initial state distribution: is the probability that s_i is a start state.





_i

(16)

The Three Basic HMM Problems

• Problem 1 (Evaluation): Given the observation sequence O=o₁,…,o_T and an HMM model

, how do we compute the probability of O given the model?

• Problem 2 (Decoding): Given the observation sequence O=o₁,…,o_T and an HMM model , how do we find the state sequence that best explains the

observations?



  ( A , B ,  )







(

A

,

B

,  )

(17)

• Problem 3 (Learning): How do we adjust the model parameters , to maximize ?

The Three Basic HMM Problems



  (A,B,)



P(O | )

(18)

Problem 1: Probability of an Observation Sequence

• What is ?

• The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM.

• Naïve computation is very expensive. Given T

observations and N states, there are N^T possible state sequences.

• Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths

• Solution to this and problem 2 is to use dynamic programming



P(O |



)

(19)

Forward Probabilities

• What is the probability that, given an HMM , at time t the state is i and the partial

observation o

₁

… o

_t

has been generated?





_t

(i)  P (o

₁

... o

_t

, q

_t

 s

_i

|  )





(20)

Forward Probabilities



_t(j) _t1(i)a_ij

i1

N



 

b_j(o_t)



_t(i)  P(o₁...o_t,q_t  s_i | )

(21)

Forward Algorithm

• Initialization:

• Induction:

• Termination:



_t( j)  _t₁(i)a_ij

i1



N



 

b_j(o_t) 2  t  T,1 j  N





₁(i) 



_ib_i(o₁) 1  i  N



P(O

|  )





_T

(i)

i1



N

(22)

Forward Algorithm Complexity

• In the naïve approach to solving problem 1 it takes on the order of 2T*N

^T

computations

• The forward algorithm takes on the order of

N

²

T computations

(23)

Backward Probabilities

• Analogous to the forward probability, just in the other direction

• What is the probability that given an HMM and given the state at time t is i, the partial observation o

_t+1

… o

_T

is generated?





_t

(i)  P (o

_t_₁

...o

_T

| q

_t

 s

_i

,  )





(24)

Backward Probabilities



_t(i)  a_ijb_j(o_t_1)_t_1(j)

j1

N















_t(i)  P(o_t_1...o_T |q_t  s_i,)

(25)

Backward Algorithm

• Initialization:

• Induction:

• Termination:





_T

(i)  1, 1  i  N



_t(i)  a_ijb_j(o_t_₁)_t_1( j)

j1



N











 t  T 1...1,1 i  N



P(O

|  )





_i



₁

(i)

i1



N

(26)

Problem 2: Decoding

• The solution to Problem 1 (Evaluation) gives us the sum of all paths through an HMM efficiently.

• For Problem 2, we want to find the path with the highest probability.

• We want to find the state sequence Q=q₁…q_T, such that



Q 

arg max

Q'

P

(Q'|

O

,  )

(27)

Viterbi Algorithm

• Similar to computing the forward

probabilities, but instead of summing over

transitions from incoming states, compute the maximum

• Forward:

• Viterbi Recursion:



_t( j)  _t_1(i)a_ij

i1



N



 

b_j(o_t)





_t

( j )  max

1iN



_t_1

(i) a

_ij

  ^b

^j

^(o

^t

⁾

(28)

Viterbi Algorithm

• Initialization:

• Induction:

• Termination:

• Read out path:





₁(i) 



_ib_j(o₁) 1 i  N

 ⁽ ⁾  ⁽ ⁾

max )

(

₁

1 t ij j t

N

t

j

i _

i a b o









N j

T t

a i

j

_t _ij

N i

t

   

 

 



_



max ( ) 2 , 1

arg )

(

₁

1







p

^*

 max

1iN



_T

(i)



q

_T^*

 arg max

1iN



_T

(i)



q

_t^*

 

_t_₁

(q

_t^*_₁

) t  T  1,...,1

0 )

1( j 



(29)

Problem 3: Learning

• Up to now we’ve assumed that we know the underlying model

• Often these parameters are estimated on annotated training data, which has two drawbacks:

– Annotation is difficult and/or expensive

– Training data is different from the current data

• We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model , such that



  ( A , B,  )



 '



' arg max

 P(O | )

(30)

Problem 3: Learning

• Unfortunately, there is no known way to analytically find a global maximum, i.e., a model , such that

• But it is possible to find a local maximum

• Given an initial model , we can always find a model , such that



 '



' arg max

 P(O | )







 '



P(O | ^')  P(O | ⁾

(31)

Parameter Re-estimation

• Use the forward-backward (or Baum-Welch) algorithm.

• Using an initial parameter instantiation, the forward-backward algorithm iteratively re- estimates the parameters and improves the probability that given observation are

generated by the new parameters

(32)

Parameter Re-estimation

• Three parameters need to be re-estimated:

– Initial state distribution:

– Transition probabilities: a_i,j – Emission probabilities: b_i(o_t)





_i

(33)

Re-estimating Transition Probabilities

• What’s the probability of being in state s

_i

at time t and going to state s

_j

, given the current model and parameters?





_t

(i , j )  P (q

_t

 s

_i

, q

_t_1

 s

_j

| O ,  )

(34)

Re-estimating Transition Probabilities



_t(i,j)  _t(i) a_i,_j b_j(o_t_₁) _t_₁(j)

_t(i) a_i,_j b_j(o_t_1) _t_1(j)

j1



N i1



N



_t(i, j)  P(q_t  s_i, q_t_1  s_j |O,)

(35)

Re-estimating Transition Probabilities

• The intuition behind the re-estimation equation for transition probabilities is

• Formally:



a ˆ _i,_j  expected number of transitions from state s _i to state s_j expected number of transitions from state s _i



a ˆ _i, _j 

_t(i, j)

t1 T1



_t(i, j')

j'1



N t1 T1



(36)

Re-estimating Transition Probabilities

• Defining

As the probability of being in state s

_i

, given the complete observation O

• We can say:



a ˆ _i, _j 

_t (i, j)

t1 T1



_t(i)

t1 T1





_t(i)  _t(i, j)

j1



N

(37)

Review of Probabilities

• Forward probability:

The probability of being in state s_i, given the partial observation o₁,…,o_t

• Backward probability:

The probability of being in state s_i, given the partial observation o_t+1,…,o_T

• Transition probability:

The probability of going from state s_i, to state s_j, given the complete observation o₁,…,o_T

• State probability:

The probability of being in state s_i, given the complete observation o₁,…,o_T





_t

(i)





_t

(i)





_t

(i, j )





_t

(i)

(38)

Re-estimating Initial State Probabilities

• Initial state distribution: is the probability that s

_i

is a start state

• Re-estimation is easy:

• Formally:





_i



 ˆ

_i 

expected number of times in state s

_i

at time 1



 ˆ

_i 



₁

(i)

(39)

Re-estimation of Emission Probabilities

• Emission probabilities are re-estimated as

• Formally:

Where

Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!!



b ˆ _i(k)  expected number of times in state s _i and observe symbol v _k expected number of times in state s _i



b ˆ _i(k) 

(o_t,v_k)_t(i)

t1



T

_t(i)

t1



T



(o_t,v_k) 1, if o_t  v_k, and 0 otherwise









(40)

The Updated Model

• Coming from we get to by the following update rules:



  (A,B,)



 '  ( ˆ A , ˆ B , ˆ  )



b ˆ _i(k) 

(o_t,v_k)_t(i)

t1



T

_t(i)

t1



T



a ˆ _i, _j 

_t (i, j)

t1 T1



_t(i)

t1 T1





 ˆ

_i 



₁

(i)

(41)

Expectation Maximization

• The forward-backward algorithm is an

instance of the more general EM algorithm

– The E Step: Compute the forward and backward probabilities for a give model

– The M Step: Re-estimate the model parameters