Speech, NLP and the Web
Pushpak Bhattacharyya CSE Dept.,
IIT Bombay
Lecture 38: Unsupervised learning in HMM, PCFG; Baum Welch
(lecture 37 was on cognitive NLP by Abhijit Mishra)
Baum Welch, Pushpak
HMM
Algorithm
Problem
Language
Hindi
Marathi
English
French
Morph Analysis
Part of Speech Tagging Parsing
Semantics
CRF
HMM
MEMM
NLP Trinity
Classic problems with respect to HMM
1.Given the observation sequence, find the possible state sequences- Viterbi
2.Given the observation sequence, find its probability- forward/backward algorithm 3.Given the observation sequence find the
HMM prameters.- Baum-Welch algorithm
Baum Welch, Pushpak
Probabilistic FSM
(a1:0.3)
(a2:0.4)
(a1:0.2)
(a2:0.3) (a1:0.1)
(a2:0.2)
(a1:0.3)
(a2:0.2)
The question here is:
“what is the most likely state sequence given the output sequence seen”
S 1 S 2
Developing the tree
Start
S1 S2
S1 S2 S1 S2
S1 S2 S1 S2
1.0 0.0
0.1 0.3 0.2 0.3
1*0.1=0.1 0.3 0.0 0.0
0.1*0.2=0.02 0.1*0.4=0.04 0.3*0.3=0.09 0.3*0.2=0.06
. .
. .
€
a 1
a 2
Choose the winning sequence per state per iteration
0.2 0.4 0.3 0.2
Baum Welch, Pushpak
Tree structure contd…
S1 S2
S1 S2 S1 S2
0.1 0.3 0.2 0.3
0.027 . 0.012 .
0.09 0.06
0.09*0.1=0.009 0.018
S1
0.3
0.0081
S2 0.2
0.0054
S2 0.4
0.0048 S1
0.2
0.0024
.
a 1
a 2
The problem being addressed by this tree is S * arg max P ( S | a
1a
2a
1a
2,)
s
a1-a2-a1-a2 is the output sequence and µ the model or the machine
Path found :
(working backward)
S 1 S 2 S 1 S 2 S 1 a 2
a 1 a 1 a 2
Problem statement : Find the best possible sequence
) ,
| max (
* arg P S O
S
s
Machine or
Model Seq,
Output Seq,
State
, S O
where
} , , , { Machine
or
Model S 0 S A T
Start symbol State collection Alphabet set
Transitions
T is defined as P ( S i a k S j ) i , j , k
Baum Welch, Pushpak
How to compute P(o 0 o 1 o 2 o 3… o m )
ation Marginaliz
)
, ( )
(
S
S O P O
P
Consider the observation sequence,
1 3
2 1
0
2 1 0
..
...
m m S S
S S
S S
Om O
O O
Where S
is represent the state sequences.
Computing P(o 0 o 1 o 2 o 3… o m )
)]
| (
).
| ( )]...[
| ( ).
| ( )[
(
)
| ( )...
| ( ).
| (
).
| (
)....
| ( ).
| ( ).
(
)
| ...
( ) ...
(
)
| ( ) ( )
, (
1 0
1 0
0 0
1 1 0
0
1 1
2 0
1 0
2 1 0 1
2 1 0
m m
m m
m m
m m
m m
S S
P S
O P S
S P S
O P S
P
S O P S
O P S
O P
S S
P S
S P S
S P S
P
S O O
O O P S
S S S P
S O P S P S
O P
Baum Welch, Pushpak
Forward and Backward
Probability Calculation
Forward probability F(k,i)
Define F(k,i)= Probability of being in state S i having seen o 0 o 1 o 2 …o k
F(k,i)=P(o 0 o 1 o 2 …o k , S i )
With m as the length of the observed sequence
There are N states
P(observed sequence)=P(o 0 o 1 o 2 ..o m )
= Σ p=0,N P(o 0 o 1 o 2 ..o m , S p )
= Σ p=0,N F(m , p)
Baum Welch, Pushpak
Forward probability (contd.)
F(k , q)
= P(o
0o
1o
2..o
k, S
q)
= P(o
0o
1o
2..o
k, S
q)
= P(o
0o
1o
2..o
k-1, o
k,S
q)
= Σ
p=0,NP(o
0o
1o
2..o
k-1, S
p, o
k,S
q)
= Σ
p=0,NP(o
0o
1o
2..o
k-1, S
p).
P(o
k,S
q|o
0o
1o
2..o
k-1, S
p)
= Σ
p=0,NF(k-1,p). P(o
k,S
q|S
p)
= Σ
p=0,NF(k-1,p). P(S
po
kS
q)
O
0O
1O
2O
3… O
kO
k+1… O
m-1O
mS
0S
1S
2S
3… S
pS
q… S
mS
finalBackward probability B(k,i)
Define B(k,i)= Probability of seeing
o k o k+1 o k+2 …o m given that the state was S i
B(k,i)=P(o k o k+1 o k+2 …o m \ S i )
With m as the length of the whole observed sequence
P(observed sequence)=P(o 0 o 1 o 2 ..o m )
= P(o 0 o 1 o 2 ..o m | S 0 )
=B(0,0)
Baum Welch, Pushpak
Backward probability (contd.)
B(k , p)
= P(o k o k+1 o k+2 …o m \ S p )
= P(o k+1 o k+2 …o m , o k |S p )
= Σ q=0,N P(o k+1 o k+2 …o m , o k , S q |S p )
= Σ q=0,N P(o k ,S q |S p )
P(o k+1 o k+2 …o m |o k ,S q ,S p )
= Σ q=0,N P(o k+1 o k+2 …o m |S q ). P(o k , S q |S p )
= Σ q=0,N B(k+1,q). P(S p S q )
o
kO
0O
1O
2O
3… O
kO
k+1… O
m-1O
mS
0S
1S
2S
3… S
pS
q… S
mS
finalHMM Training
Baum Welch or Forward Backward Algorithm
Baum Welch, Pushpak
Key Intuition
Given: Training sequence Initialization: Probability values
Compute: Pr (state seq | training seq)
get expected count of transition compute rule probabilities
Approach: Initialize the probabilities and recompute them…
a
b
a
b a
b
a
b
q r
Baum-Welch algorithm: counts
String = abb aaa bbb aaa
Sequence of states with respect to input symbols
a, b
a,b
q r
a,b
r q
r q
q q
r q
r q
q r
q
a
b
b
a
a
a
b
b
b
a
a
a
o/p seq State seq
a,b
Baum Welch, Pushpak
Calculating probabilities from table
Table of counts
T=#states
A=#alphabet symbols
Now if we have a non-deterministic transitions then
multiple state seq possible for the given o/p seq (ref. to previous slide’s feature). Our aim is to find expected
8 / 3 )
( q b
P
bSrc Dest O/P Cou nt
q r a 5
q q b 3
r q a 3
r q b 2
8 / 5 )
( q r
P
a
Tl
A
m
l i
j i
j i
w s s
c w s s
s c s w
P
m k k
1 1
) (
) ) (
(
Interplay Between Two Equations
T
l
A
m
Wm l i
j W
i j
W i
s s
c
s s
s c s
P
k k
0 0
) (
) ) (
(
1 , 0
) ,
, (
)
| (
) (
, 0 1
, 0 ,
0 1
, 0
n
k k
s
n n
j W
i n
n
j W
i
w S
s s
n W
S P
s s
C
wk
No. of times the transitions s
Baum Welch, Pushpak is
joccurs in the string
Illustration
a:0.67
b:1.0 b:0.17
a:0.16
q r
a:0.04
b:1.0 b:0.48
a:0.48
q r
Actual (Desired) HMM
Initial guess
One run of Baum-Welch algorithm: string ababb
P(path)
q r q r q q 0.00077 0.00154 0.00154 0 0.0007
7
q r q q q q 0.00442 0.00442 0.00442 0.0044
2
0.0088 4
q q q r q q 0.00442 0.00442 0.00442 0.0044
2
0.0088 4
q q q q q q 0.02548 0.0 0.000 0.0509
6
0.0764 4
Rounded Total 0.035 0.01 0.01 0.06 0.095
New Probabilities (P) 0.06
=(0.01/(0.
01+0.06+
0.095)
1.0 0.36 0.581
b q qa q q
a r
q b q
r
a
abb a a b b b b
* ε is considered as starting and ending symbol of the input sequence string.
State sequences
Through multiple iterations the probability values will converge.
Baum Welch, Pushpak
Related example: Word alignment
English (1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French (1) trois lapins
w x
(2) lapins de Grenoble
x y z
Initial Probabilities:
each cell denotes t(a w), t(a x) etc.
a b c d
w 1/4 1/4 1/4 1/4
x 1/4 1/4 1/4 1/4
y 1/4 1/4 1/4 1/4
z 1/4 1/4 1/4 1/4
“counts”
b c d
x y z
a b c d
w 0 0 0 0
x 0 1/3 1/3 1/3
y 0 1/3 1/3 1/3
z 0 1/3 1/3 1/3
a b
w x
a b c d
w 1/2 1/2 0 0
x 1/2 1/2 0 0
y 0 0 0 0
z 0 0 0 0
Revised probabilities table
a b c d
w 1/2 1/4 0 0
x 1/2 5/12 1/3 1/3
y 0 1/6 1/3 1/3
z 0 1/6 1/3 1/3
“revised counts”
b c d
x y z
a b c d
w 0 0 0 0
x 0 5/9 1/3 1/3
y 0 2/9 1/3 1/3
z 0 2/9 1/3 1/3
a b
w x
a b c d
w 1/2 3/8 0 0
x 1/2 5/8 0 0
y 0 0 0 0
z 0 0 0 0
Re-Revised probabilities table
a b c d
w 1/2 3/16 0 0
x 1/2 85/144 1/3 1/3
y 0 1/9 1/3 1/3
z 0 1/9 1/3 1/3
Continue until convergence; notice that (b,x) binding gets progressively stronger;
b=rabbits, x=lapins
Computational part (1/2)
n t
n j
t k t
i t
n
n
t s
n n
j t
k t
i t
n s
n n
W j i
n n
n
s
n n
j W
i n
n j
W i
W s
S w W
s S
W P P
W S
s S
w W
s S
W P P
W S
s s
n W
S W P
P
W S
s s
n W
S P s
s C
n n
k n
k k
, 0
, 0 1
, 0
, 0
, 0 1 , 0 1
, 0
, 0 1 , 0 ,
0 1 , 0 ,
0
, 0 1 , 0 ,
0 1
, 0
)]
, ,
, (
) [ (
1
)]
, ,
, ,
( ) [
( 1
)]
, ,
( )
, (
) [ (
1
)]
, ,
( )
| (
[ )
(
1 , 0 1 , 0
1 , 0
Computational part (2/2)
) , 1 ( ) (
) , 1 (
) , 1 ( )
| ,
( ) , 1 (
) , 1 ( )
| ,
( ) , 1 (
)
| (
) ,
| ,
( ) ,
(
) ,
, ,
, (
) ,
, ,
(
0 0
1 0
1
1 0
, 1 1
, 1 0
1 , 0 0
, 1 1
1 , 0 0
, 1 0
j t
B w s
s P i t
F
j t
B s S w W
s S
P i t
F
j t
B s S w W
s S
P i t
F
s S
W P s S W
w W
s S
P s S W
P
W w W
s S
s S W
P
W w W
s S
s S P
n
t
j i
n
t
t i t k
t j n
t
t i t k
t j
t j n
t
n t t i
t t k
t j t i
t n
t
n t t k
t j t i
t n
t
n t k
t j t i
k
w0 w1 w2 wk wn-1 wn
S0 S1 S1 … Si Sj …
Baum Welch, PushpakSn-1 Sn Sn+1
Discussions
1. Symmetry breaking:
Example: Symmetry breaking leads to no change in initial values
2 Struck in Local maxima 3. Label bias problem
Probabilities have to sum to 1.
s
s s
b:1.0
b:0.5 a:0.5
a:1.0
s
s s
a:0.5
b:0.5
a:0.25 a:0.5 b:0.5
a:0.25
b:0.25
b:0.5
Desired Initialized
HMM ↔ PCFG
O observed sequence ↔ w 1m sentence
X state sequence ↔ t parse tree
model ↔ G grammar
Three fundamental questions
Baum Welch, Pushpak
HMM ↔ PCFG
How likely is a certain observation given the model? ↔ How likely is a sentence given the grammar?
How to choose a state sequence which best
explains the observations? ↔ How to choose a parse which best supports the sentence?
arg max ( | , )
X
P X O arg max ( | 1 m , )
t
P t w G
↔
( 1 m | ) P w G ( | )
P O ↔
HMM ↔ PCFG
How to choose the model parameters that best explain the observed data? ↔ How to choose rule probabilities which maximize the probabilities of the observed sentences?
arg max ( P O | )
arg max ( 1 m | )
G
P w G
↔
Baum Welch, Pushpak
Interesting Probabilities
The gunman sprayed the building with bullets 1 2 3 4 5 6 7
(4, 5)
N
1NP
(4, 5)
NPWhat is the probability of having a NP at this position such that it will derive
“the building” ? -
What is the probability of starting from N
1and deriving “The gunman sprayed”, a NP and “with bullets” ? -
Inside Probabilities
Outside Probabilities
Interesting Probabilities
Random variables to be considered
The non-terminal being expanded.
E.g., NP
The word-span covered by the non-terminal.
E.g., (4,5) refers to words “the building”
While calculating probabilities, consider:
The rule to be used for expansion : E.g., NP DT NN
The probabilities associated with the RHS non- terminals : E.g., DT subtree’s inside/outside probabilities & NN subtree’s inside/outside probabilities
Baum Welch, Pushpak
Outside Probability
j (p,q) : The probability of beginning with N 1
& generating the non-terminal N j pq and all words outside w p ..w q
1( 1) ( 1)
( , ) ( , j , | )
j p q P w p N pq w q m G
w
1………w
p-1w
p…w
qw
q+1……… w
mN
1N
jInside Probabilities
j (p,q) : The probability of generating the words w p ..w q starting with the non-terminal
N j pq . ( , ) ( |
j, )
j
p q P w
pqN
pqG
w
1………w
p-1w
p…w
qw
q+1……… w
m
N
1N
jBaum Welch, Pushpak
Outside & Inside Probabilities:
example
The gunman sprayed the building with bullets
4,5
(4, 5) for "the building"
(The gunman sprayed, , with bullets | )
NP
P NP G
N
1NP
(4, 5) for "the building" (the building |
4,5, )
NP
P NP G
PCFG Training
Baum Welch, Pushpak