CS626-460: Speech, NLP and the p , Web
Pushpak Bhattacharyya
CSE Dept., IIT Bombay
Lecture 5,7: HMM and Viterbi 10
thand 16
thJan, 2012
(Lecture 6 was on Computational Biomedicine research at Houston University by Prof. Ioannis)
HMM Definition
ProblemPart of Speech Parsing
Semantics NLP Trinity
Set of states : S where |S|=N
Output Alphabet : O where |O|=K Hindi
Marathi English
French Morph
Analysis
p Tagging
HMM
Transition Probabilities : A = {aij}
prob. of going from state Si to state Sj
E i i P b biliti B {b }
Algorithm
Language CRF MEMM
Emission Probabilities : B = {bpq}
prob. of outputting symbol Oq from state Sp
Initial State Probabilities : πt a State obab t es
) ,
,
( π
λ = ( A , B , π )
λ A B
Markov Processes
Properties
Limited Horizon: Given previous t states a
Limited Horizon: Given previous t states, a state i, is independent of preceding 0 to t- k+1 states .
P(Xt=i|Xt-1, Xt-2 ,… X0) = P(Xt=i|Xt-1, Xt-2… Xt-k)
Order k Markov process
Time invariance: (shown for k=1 )
P(Xt=i|Xt-1=j) = P(X1=i|X0=j) …= P(Xn=i|Xn-1=j)
Three basic problems (contd.)
Problem 1: Likelihood of a sequence
Forward Procedure
Forward Procedure
Backward Procedure
Problem 2: Best state sequence
Problem 2: Best state sequence
Viterbi Algorithm
P bl 3 R ti ti
Problem 3: Re-estimation
Baum-Welch ( Forward-Backward Al ith )
Algorithm )
Probabilistic Inference
O: Observation Sequence
S: State Sequence
Given O find S* where called
Probabilistic Inference
* arg max ( / )
S
S = p S O
Probabilistic Inference
Infer “Hidden” from “Observed”
S
Infer Hidden from Observed
How is this inference different from logical inference based on propositional or predicate calculus?
Essentials of Hidden Markov Model
1. Markov + Naive Bayes
2 Uses both transition and observation probability 2. Uses both transition and observation probability
1 1
(
k Ok k) (
k/
k) (
k/
k) p S → S
+= p O S p S
+S
3. Effectively makes Hidden Markov Model a Finite State
1 1
(
k k) (
k k) (
k k)
p
+p p
+y
Machine (FSM) with probability
Probability of Observation Sequence
( ) ( , )
S
p O = ∑ p O S
= ( ) ( / )
S
S
p S p O S
∑
Without any restriction,
Search space size= |S|
|O|Continuing with the Urn example
Colored Ball choosing
Urn 1
# of Red = 30
# of Green = 50
Urn 3
# of Red =60
# of Green =10 Urn 2
# of Red = 10
# of Green 40
# of Green 50
# of Blue = 20 # of Green 10
# of Blue = 30
# of Green = 40
# of Blue = 50
Example (contd.)
U1 U2 U3
Gi
R G B
U 0 3 0 5 0 2
Transition Probability Observation/output Probability
U1 0.1 0.4 0.5 U2 0.6 0.2 0.2 U 0 3 0 4 0 3
Given :
and U1 0.3 0.5 0.2
U2 0.1 0.4 0.5 U3 0.6 0.1 0.3 U3 0.3 0.4 0.3
Observation : RRGGBRGR
U3 0.6 0.1 0.3
What is the corresponding state sequence ?
Diagrammatic representation (1/2)
G 0 5
0 3 0 3
B, 0.2 R, 0.3 G, 0.5
U1 U3
0.1
0.5
0.3 0.3
R, 0.6 0.2
0.4 0.6
0.4
G, 0.1 B, 0.3
U2 R, 0.1
G, 0.4
0.2 B, 0.5
Diagrammatic representation (2/2) g p ( / )
R,0.18 G,0.03 R,0.03
G,0.05 B,0.02
U1 U3
R 0 02 R,0.15
G,0.25 B,0.10
B,0.09
R,0.18 G,0.03 B,0.09 R,0.02
G,0.08 B,0.10
R,0.24 G 0 04 R,0.06
G,0.24 B,0.30 R, 0.08
G 0 20
B,0.10 B,0.09
U2
G,0.04 B,0.12 G, 0.20
B, 0.12
R,0.02, G,0.08 B,0.10
Observations and states
O
1O
2O
3O
4O
5O
6O
7O
8O S G G G
OBS: R R G G B R G R State: S
1S
2S
3S
4S
5S
6S
7S
8S
ii= U
11/U /
22/U /
33; A particular state ; p S: State sequence
O: Observation sequence O: Observation sequence
S* = “best” possible state (urn) sequence
Goal: Maximize P(S|O) by choosing “best” S
Goal: Maximize P(S|O) by choosing best S
Goal
Maximize P(S|O) where S is the State Sequence and O is the Observation
Sequence and O is the Observation Sequence
))
| (
( max
arg
* P S O
S =
SBaye’s Theorem
) (
/ )
| (
).
( )
|
( A B P A P B A P B
P =
P(A) -: Prior
P(B|A) -: Likelihood P(B|A) : Likelihood
)
| (
).
( max
arg )
| ( max
arg max
SP ( S | O ) = arg max
SP ( S ). P ( O | S )
arg
SP S O
SP S P O S
State Transitions Probability
)
| ( )...
| ( ).
| ( ).
| ( ).
( )
(
) (
) (
7 1 8 3
1 4 2
1 3 1
2 1
8 1
−
−
−
−
=
=
S S P S
S P S
S P S
S P S
P S
P
S P S
P
)
| ( )
| ( )
| ( )
| ( ) ( )
(
By Markov Assumption (k=1)
)
| ( )...
| ( ).
| ( ).
| ( ).
( )
( S P S
1P S
2S
1P S
3S
2P S
4S
3P S
8S
7P =
Observation Sequence probability
) ,
| ( )...
,
| ( ).
,
| ( ).
| ( )
|
( O S = P O
1S
1−8P O
2O
1S
1−8P O
3O
1−2S
1−8P O
8O
1−7S
1−8P
Assumption that ball drawn depends only Assumption that ball drawn depends only on the Urn chosen
)
| (
)
| (
)
| (
)
| (
)
|
( O S P O S P O S P O S P O S
P ( O | S ) P ( O
1| S
1). P ( O
2| S
2). P ( O
3| S
3)... P ( O
8| S
8)
P =
)
| ( ).
( )
|
( S O P S P O S
P =
)
| (
)...
| (
).
| (
).
| (
).
| (
)...
| (
).
| (
).
| (
).
( )
| (
8 8
3 3
2 2
1 1
7 8
3 4
2 3
1 2
1
S O
P S
O P S
O P S
O P
S S
P S
S P S
S P S
S P S
P O
S
P =
)
| (
)
| (
)
| (
)
|
(
Grouping terms
O0 O1 O2 O3 O4 O5 O6 O7 O8
Obs: ε R R G G B R G R
S S S S S S S S S S
P(S).P(O|S)
= [P(O |S ) P(S |S )] We introduce the states S d S i i i l
State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9
= [P(O0|S0).P(S1|S0)].
[P(O1|S1). P(S2|S1)].
[P(O2|S2). P(S3|S2)].
S0 and S9 as initial and final states respectively.
[P(O3|S3).P(S4|S3)].
[P(O4|S4).P(S5|S4)].
[P(O5|S5).P(S6|S5)].
p y
After S8 the next state is S9 with probability 1 i e P(S |S ) 1
[ ( 5| 5) ( 6| 5)]
[P(O6|S6).P(S7|S6)].
[P(O7|S7).P(S8|S7)].
[P(O |S ) P(S |S )]
1, i.e., P(S9|S8)=1 O0 is ε-transition
[P(O8|S8).P(S9|S8)].
Introducing useful notation
O0 O1 O2 O3 O4 O5 O6 O7 O8
Obs: ε R R G G B R G R
S S S S S S S S S S
State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9
R G G B R
S0 S1 S7
S2 S3 S4 S5 S6
ε R R G G B R
G S8
O R
S9
P(Ok|Sk).P(Sk+1|Sk)=P(SkÆSOk k+1)
Viterbi Algorithm for the Urn problem (first two symbols)
S0 0.5
0 3
0.2 ε
U1 U2 U3
0.3
0 15 0.03
0 08
0.15
0.06
0.02
0.18 0.18
R
U1 U2 U3
0.08
U1 U2 U3 U1 U2 U3
0.02 0.24
0.015 0.04 0.075* 0.018 0.006 0.006 0.048* 0.036
*: winner sequences
Markov process of order>1 (say 2)
O0 O1 O2 O3 O4 O5 O6 O7 O8
Obs: ε R R G G B R G R
S S S S S S S S S S
Same theory works
P(S) P(O|S) We introduce the states
S d S i i i l
State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9
P(S).P(O|S)
= P(O0|S0).P(S1|S0).
[P(O1|S1). P(S2|S1S0)].
S0 and S9 as initial and final states respectively.
[P(O2|S2). P(S3|S2S1)].
[P(O3|S3).P(S4|S3S2)].
[P(O4|S4).P(S5|S4S3)].
p y
After S8 the next state is S9 with probability 1 i e P(S |S S ) 1
[ ( 4| 4) ( 5| 4 3)]
[P(O5|S5).P(S6|S5S4)].
[P(O6|S6).P(S7|S6S5)].
[P(O |S ) P(S |S S )]
1, i.e., P(S9|S8S7)=1 O0 is ε-transition
[P(O7|S7).P(S8|S7S6)].
[P(O8|S8).P(S9|S8S7)].
Adjustments
Transition probability table will have tuples on rows and states on columns
rows and states on columns
Output probability table will remain the same
In the Viterbi tree the Markov process will
In the Viterbi tree, the Markov process will take effect from the 3
rdinput symbol (εRR)
There will be 27 leaves, out of which There will be 27 leaves, out of which only 9 only 9 will remain
Sequences ending in same tuples q g p will be compared
Instead of U1, U2 and U3
U1U1, U1U2, U1U3, U2U1, U2U2,U2U3, U3U1,U3U2,U3U3
Probabilistic FSM Probabilistic FSM
(a1:0.3)
(a2:0.4)
(a1:0.1) (a2:0.4) (a1:0.3)
(a1:0 2) (a1:0.1)
(a2:0 2)
(a1:0.3)
(a2:0 2)
S1 S2
(a1:0.2)
(a2:0.3)
(a2:0.2) (a2:0.2)
The question here is:
“what is the most likely state sequence given the output sequence seen”
Developing the tree Developing the tree
Start
1 0 0 0 €
S1 S2
1.0 0.0
0.1 0.3 0.2 0.3
€
a1
S1 S2 S1 S2
0.1 0.3 0.2 0.3
1*0.1=0.1 9. 0.3 9. 0.0 0.0
a1
0 2 0 2
S1 S2 S1 S2
9. 9.
a2
0.2 0.4 0.3 0.2
0.1*0.2=0.02 0.1*0.4=0.04 0.3*0.3=0.09 0.3*0.2=0.06 Choose the winning sequence per state per iteration
Tree structure contd
Tree structure contd…
S1 S2
0.09 0.06
S1 S2
S1 S2 S1 S2
0.1 0.3 0.2 0.3
0.027 9. 0.012 9.
0.09*0.1=0.009 0.018
a1
0.3 0.2 0.2 0.4 a2
S1 S2 S2
0 00 8 9. S1
0.0081 0.0054 0.0024 0.0048
Th bl b i dd d b thi t i S* = argmax P(S | a1 a2 a1 a2 )
The problem being addressed by this tree is S* argmax P(S | a1 a2 a1 a2,μ)
s
−
−
−
=
a1-a2-a1-a2 is the output sequence and μ the model or the machine
P th f d
S1 S2 S1 S2 S1Path found
:(working backward)
S1 S2 S1 S2 S1 a2
a1 a1 a2
Problem statement
: Find the best possible sequence) ,
| max (
* arg P S O μ S = arg max P ( S | O , μ ) S
s
Machine or
Model Seq,
Output Seq,
State
, S → O → μ →
where
} , , , { Machine
or
Model = S
0S A T
Start symbol State collection Alphabet set
Transitions
T is defined as P ( S
i⎯ ⎯→ a
kS
j) ∀
i, j, kTabular representation of the tree
€ a1 a2 a1 a2
Latest symbol observed
S1 1.0 (1.0*0.1,0.0*0.2
)=(0.1,0.0) (0.02,
0.09) (0.009, 0.012) (0.0024, 0.0081)
Ending state
S2 0.0 (1.0*0.3,0.0*0.3
)=(0.3,0.0) (0.04,0.0
6) (0.027,0.018) (0.0048,0.005 4)
Note: Every cell records the winning probability ending in that state Note: Every cell records the winning probability ending in that state Final winner The bold faced values in each cell shows the
sequence probability ending in that state Going backward sequence probability ending in that state. Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple), we recover the sequence.
Algorithm
(f ll l l d d
(following James Alan, Natural Language Understanding (2
ndedition), Benjamin Cummins (pub.), 1995
Given:
1. The HMM, which means:
a Start State: S1
a. Start State: S1
b. Alphabet: A = {a1, a2, … ap}
c. Set of States: S = {S1, S2, … Sn}
d. Transition probabilityp y P(Si ⎯⎯ →ak Sj) ∀i, j,k which is equal to
2. The output string a1a2…aT
)
| ,
(S j a k S i P
To find:
The most likely sequence of states C1C2…CT which produces the given output sequence, i.e., C1C2…CT = argmax[ ( | 1, 2,... T,μ]
C
a a a C P
C
Algorithm contd…
Data Structure: ata St uctu e
1. A N*T array called SEQSCORE to maintain the
winner sequence always (N=#states, T=length of o/p sequence)
o/p sequence)
2. Another N*T array called BACKPTR to recover the path.
Three distinct steps in the Viterbi implementation
Initialization
1. Initialization
2. Iteration
3. Sequence Identificationq
1. Initialization
SEQSCORE(1,1)=1.0 BACKPTR(1,1)=0
For(i=2 to N) do
SEQSCORE(i,1)=0.0
[expressing the fact that first state [expressing the fact that first state
is S1]
2 Iteration 2. Iteration
For(t=2 to T) do For(i 1 to N) do For(i=1 to N) do
SEQSCORE(i,t) = Max(j=1,N)
)]
(
* )) 1 (
(
[SEQSCORE j t − P Sj ⎯⎯ →ak Si
BACKPTR(I,t) = index j that gives the MAX above
)]
( ))
1 (
, (
[SEQSCORE j t P Sj → Si
3. Seq. Identification q
C(T) = i that maximizes SEQSCORE(i,T) For i from (T-1) to 1 do
For i from (T 1) to 1 do
C(i) = BACKPTR[C(i+1),(i+1)]
Optimizations possible:
1. BACKPTR can be 1*T 2. SEQSCORE can be T*2
Homework:‐ Compare this with A*, Beam Search [Homework]
Reason for this comparison:
Both of them work for finding and recovering sequence Both of them work for finding and recovering sequence