HMM
An Example: Choosing Colored balls from 3 urns
• URN 1 • URN 2 • URN 3
Probability of transition from one Urn to another
URN 1 URN 2 URN 3
URN 1 0.1 0.4 0.5
URN 2 0.6 0.2 0.2
URN 3 0.3 0.4 0.3
• URN 1
• Red=30
• Green=50
• Blue =20
• URN 2
• Red=10
• Green=40
• Blue=50
• URN 1
• Red=60
• Green=10
• Blue=30
Probability drawing a ball
R G B
URN 1 0.3 0.5 0.2
URN 2 0.1 0.4 0.5
URN 3 0.6 0.1 0.3
U3 U1
U2 R, 0.3 G, 0.5
B, 0.2
0.3
0.3
0.1 0.5
0.4 0.4 0.2
0.6
0.2
R, 0.6
U3 U1
U2
R, 0.03 G, 0.05 B, 0.02
R, 0.18 G, 0.03 B, 0.09
R, 0.18 G, 0.03 B, 0.09
R, 0.24 G, 0.04 B, 0.12 R, 0.02
G, 0.08 B, 0.10
R, 0.02 G, 0.08 B, 0.10 R, 0.08
G, 0.20 B, 0.12
R, 0.06 G, 0.24 B, 0.30
R, 0.15 G, 0.25 B, 0.10
Probability drawing a ball
(B)
R G B
URN 1 0.3 0.5 0.2 URN 2 0.1 0.4 0.5 URN 3 0.6 0.1 0.3
Probability of transition from one Urn to another
(A)
URN 1 URN 2 URN 3
URN 1 0.1 0.4 0.5
URN 2 0.6 0.2 0.2
URN 3 0.3 0.4 0.3
Observation and state sequence
o
1o
2o
3o
4o
5o
6o
7o
8Obs.: R R G G B R G R
Sates: q
1q
2q
3q
4q
5q
6q
7q
8q
i= U
1/U
2/U
3; any particular state
Object: To fine the best possible state sequence
that maximizes P(Q*|O) by choosing the best Q
Goal
o
1o
2o
3o
4o
5o
6o
7o
8Obs.: R R G G B R G R
Sates: q
1q
2q
3q
4q
5q
6q
7q
8Baye’s Theorem
Probability: Observation Sequence
Markov Assumption
• The Markov assumption states that probability of the occurrence of word w
iat time t
depends only on occurrence of word w
i-1at time t-1
– Chain rule:
– Markov assumption:
P(w1,...,wn) P(wi | wi1)
i2
n
P(w1,...,wn) P(wi | w1,...,wi1)
i2
nThe Trellis
Parameters of an HMM
• States: A set of states S=s1,…,sn
• Transition probabilities: A= a1,1,a1,2,…,an,n each ai,j
represents the probability of transitioning from state si to sj.
• Emission probabilities: A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si.
• Initial state distribution: is the probability that si is a start state.
iThe Three Basic HMM Problems
• Problem 1 (Evaluation): Given the observation sequence O=o1,…,oT and an HMM model
, how do we compute the probability of O given the model?
• Problem 2 (Decoding): Given the observation sequence O=o1,…,oT and an HMM model , how do we find the state sequence that best explains the
observations?
( A , B , )
(
A,
B, )
• Problem 3 (Learning): How do we adjust the model parameters , to maximize ?
The Three Basic HMM Problems
(A,B,)
P(O | )
Problem 1: Probability of an Observation Sequence
• What is ?
• The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM.
• Naïve computation is very expensive. Given T
observations and N states, there are NT possible state sequences.
• Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths
• Solution to this and problem 2 is to use dynamic programming
P(O |
)Forward Probabilities
• What is the probability that, given an HMM , at time t the state is i and the partial
observation o
1… o
thas been generated?
t(i) P (o
1... o
t, q
t s
i| )
Forward Probabilities
t(j) t1(i)aij
i1
N
bj(ot)
t(i) P(o1...ot,qt si | )
Forward Algorithm
• Initialization:
• Induction:
• Termination:
t( j) t1(i)aij
i1
N
bj(ot) 2 t T,1 j N
1(i)
ibi(o1) 1 i N
P(O
| )
T(i)
i1
NForward Algorithm Complexity
• In the naïve approach to solving problem 1 it takes on the order of 2T*N
Tcomputations
• The forward algorithm takes on the order of
N
2T computations
Backward Probabilities
• Analogous to the forward probability, just in the other direction
• What is the probability that given an HMM and given the state at time t is i, the partial observation o
t+1… o
Tis generated?
t(i) P (o
t1...o
T| q
t s
i, )
Backward Probabilities
t(i) aijbj(ot1)t1(j)
j1
N
t(i) P(ot1...oT |qt si,)
Backward Algorithm
• Initialization:
• Induction:
• Termination:
T(i) 1, 1 i N
t(i) aijbj(ot1)t1( j)
j1
N
t T 1...1,1 i N
P(O
| )
i
1(i)
i1
NProblem 2: Decoding
• The solution to Problem 1 (Evaluation) gives us the sum of all paths through an HMM efficiently.
• For Problem 2, we want to find the path with the highest probability.
• We want to find the state sequence Q=q1…qT, such that
Q
arg max
Q'
P
(Q'|
O, )
Viterbi Algorithm
• Similar to computing the forward
probabilities, but instead of summing over
transitions from incoming states, compute the maximum
• Forward:
• Viterbi Recursion:
t( j) t1(i)aij
i1
N
bj(ot)
t( j ) max
1iN
t1(i) a
ij bj (o
t )
Viterbi Algorithm
• Initialization:
• Induction:
• Termination:
• Read out path:
1(i)
ibj(o1) 1 i N ( ) ( )
max )
(
11 t ij j t
N
t
j
i i a b o
N j
T t
a i
j
t ijN i
t
max ( ) 2 , 1
arg )
(
11
p
* max
1iN
T(i)
q
T* arg max
1iN
T(i)
q
t*
t1(q
t*1) t T 1,...,1
0 )
1( j
Problem 3: Learning
• Up to now we’ve assumed that we know the underlying model
• Often these parameters are estimated on annotated training data, which has two drawbacks:
– Annotation is difficult and/or expensive
– Training data is different from the current data
• We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model , such that
( A , B, )
'
' arg max
P(O | )
Problem 3: Learning
• Unfortunately, there is no known way to analytically find a global maximum, i.e., a model , such that
• But it is possible to find a local maximum
• Given an initial model , we can always find a model , such that
'
' arg max
P(O | )
'
P(O | ') P(O | )
Parameter Re-estimation
• Use the forward-backward (or Baum-Welch) algorithm.
• Using an initial parameter instantiation, the forward-backward algorithm iteratively re- estimates the parameters and improves the probability that given observation are
generated by the new parameters
Parameter Re-estimation
• Three parameters need to be re-estimated:
– Initial state distribution:
– Transition probabilities: ai,j – Emission probabilities: bi(ot)
iRe-estimating Transition Probabilities
• What’s the probability of being in state s
iat time t and going to state s
j, given the current model and parameters?
t(i , j ) P (q
t s
i, q
t1 s
j| O , )
Re-estimating Transition Probabilities
t(i,j) t(i) ai,j bj(ot1) t1(j)
t(i) ai,j bj(ot1) t1(j)
j1
N i1
N
t(i, j) P(qt si, qt1 sj |O,)
Re-estimating Transition Probabilities
• The intuition behind the re-estimation equation for transition probabilities is
• Formally:
a ˆ i,j expected number of transitions from state s i to state sj expected number of transitions from state s i
a ˆ i, j
t(i, j)
t1 T1
t(i, j')
j'1
N t1 T1
Re-estimating Transition Probabilities
• Defining
As the probability of being in state s
i, given the complete observation O
• We can say:
a ˆ i, j
t (i, j)
t1 T1
t(i)
t1 T1
t(i) t(i, j)
j1
NReview of Probabilities
• Forward probability:
The probability of being in state si, given the partial observation o1,…,ot
• Backward probability:
The probability of being in state si, given the partial observation ot+1,…,oT
• Transition probability:
The probability of going from state si, to state sj, given the complete observation o1,…,oT
• State probability:
The probability of being in state si, given the complete observation o1,…,oT
t(i)
t(i)
t(i, j )
t(i)
Re-estimating Initial State Probabilities
• Initial state distribution: is the probability that s
iis a start state
• Re-estimation is easy:
• Formally:
i
ˆ
i expected number of times in state s
iat time 1
ˆ
i
1(i)
Re-estimation of Emission Probabilities
• Emission probabilities are re-estimated as
• Formally:
Where
Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!!
b ˆ i(k) expected number of times in state s i and observe symbol v k expected number of times in state s i
b ˆ i(k)
(ot,vk)t(i)
t1
Tt(i)
t1
T
(ot,vk) 1, if ot vk, and 0 otherwise
The Updated Model
• Coming from we get to by the following update rules:
(A,B,)
' ( ˆ A , ˆ B , ˆ )
b ˆ i(k)
(ot,vk)t(i)
t1
Tt(i)
t1
T
a ˆ i, j
t (i, j)
t1 T1
t(i)
t1 T1
ˆ
i
1(i)
Expectation Maximization
• The forward-backward algorithm is an
instance of the more general EM algorithm
– The E Step: Compute the forward and backward probabilities for a give model
– The M Step: Re-estimate the model parameters