Instructor: Preethi Jyothi Sep 21, 2017
Automatic Speech Recognition (CS753)
Lecture 15: Search & Decoding (Part I)
Automatic Speech Recognition (CS753)
The Big Picture
Putting it all together:
How do we recognise an utterance?
• A: speech utterance
• OA: acoustic features corresponding to the utterance A
• Return the word sequence that jointly assigns the highest probability to OA
• How do we estimate Pr(OA|W) and Pr(W)?
• How do we decode?
W ⇤ = arg max
W
Pr(OA|W ) Pr(W )
Acoustic Model
Pr(OA|W ) = X
Q
Pr(OA, Q|W )
= X
q1T ,w1N
YT
t=1
Pr(Ot|O1t 1, q1t, w1N ) Pr(qt|q1t 1, w1N )
⇡ X
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )
⇡ max
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )
W ⇤ = arg max
W
Pr(OA|W ) Pr(W )
Pr(OA|W ) = X
Q
Pr(OA, Q|W )
= X
q1T ,w1N
YT
t=1
Pr(Ot|O1t 1, q1t, w1N ) Pr(qt|q1t 1, w1N )
⇡ X
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )
⇡ max
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N ) Pr(OA|W ) = X
Q
Pr(OA, Q|W )
= X
q1T ,w1N
YT
t=1
Pr(Ot|O1t 1, q1t, w1N ) Pr(qt|q1t 1, w1N )
⇡ X
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )
⇡ max
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N ) Pr(OA|W ) = X
Q
Pr(OA, Q|W )
= X
q1T ,w1N
YT
t=1
Pr(Ot|O1t 1, q1t, w1N ) Pr(qt|q1t 1, w1N )
⇡ X
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )
⇡ max
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )
First-order HMM assumptions
Viterbi approximation
Acoustic Model
Pr(OA|W ) = max
q1T ,w1N
YT
t=1
Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )
Emission probabilities
Pr(O|q; w1N ) =
Lq
X
`=1
cq`N (O|µq`, ⌃q`; w1N )
Modeled using a mixture of Gaussians
Transition probabilities
• All the free parameters (means, covariances, mixture weights, transition probabilities) are learned using EM (Baum-Welch) algorithm
Language Model
W ⇤ = arg max
W
Pr(OA|W ) Pr(W )
m-gram language model
• Further optimized using smoothing and interpolation with lower-order Ngram models
Pr(W ) = Pr(w1, w2, . . . , wN )
= Pr(w1) . . . Pr(wN |wNN m+11 )
Decoding: Search
W ⇤ = arg max
W
Pr(OA|W ) Pr(W )
W⇤ = arg max
wN1 ,N
8<
:
" N Y
n=1
Pr(wn|wn m+1n 1 )
# 2
4 X
q1T ,w1N
YT t=1
Pr(Ot|qt, w1N) Pr(qt|qt 1, w1N) 3 5
9=
;
⇡ arg max
w1N,N
(" N Y
n=1
Pr(wn|wn m+1n 1 )
# "
max
q1T,w1N
YT
t=1
Pr(Ot|qt, w1N) Pr(qt|qt 1, w1N)
Viterbi #)
• Viterbi approximation divides the above optimisation problem into sub- problems that allows the efficient application of dynamic programming
• Search space still very huge for LVCSR tasks! Use approximate decoding techniques (beam-width decoding, etc.) to visit only promising parts of the search space
Recall Viterbi search
• Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis
• Viterbi algorithm: Only needs to maintain information about the most probable path at each state
9.5 • HMM TRAINING: THE FORWARD-BACKWARD ALGORITHM 13
start
H
C
H
C
H
C
end
P(C
|start) * P(3
|C)
.2 * .1
P(H|H) * P(1|H) .6 * .2
P(C|C) * P(1|C) .5 * .5 P(C|H) * P(1
.3 * .5 |C)
P(H|C) * P(1|H) .4 * .2 P(H
|start)*P(3
|H)
.8 * .4
v1(2)=.32
v1(1) = .02
v2(2)= max(.32*.12, .02*.08) = .038
v2(1) = max(.32*.15, .02*.25) = .048
start start start
t
C H
end end end
qF
q2
q1
q0
o1 o2 o3
3 1 3
Figure 9.12 The Viterbi backtrace. As we extend each path to a new state account for the next observation, we keep a backpointer (shown with broken lines) to the best path that led us to this state.
Finally, we can give a formal definition of the Viterbi recursion as follows:
1. Initialization:
v1(j) = a0jbj(o1) 1 j N (9.20)
bt1(j) = 0 (9.21)
2. Recursion (recall that states 0 and qF are non-emitting):
vt(j) = maxN
i=1 vt 1(i)ai jbj(ot); 1 j N,1 <t T (9.22) btt(j) = argmaxN
i=1 vt 1(i)ai jbj(ot); 1 j N,1 <t T (9.23) 3. Termination:
The best score: P⇤= vT(qF) = maxN
i=1 vT(i)⇤aiF (9.24) The start of backtrace: qT⇤= btT(qF) = argmaxN
i=1 vT(i)⇤aiF (9.25)
9.5 HMM Training: The Forward-Backward Algorithm
We turn to the third problem for HMMs: learning the parameters of an HMM, that is, the A and B matrices. Formally,
Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9
ASR Search Network
0
the birds are
boy is
walking
d ax
b
oy b
Network of words Network of
phones Network of HMM states
Time-state trellis
word1word2word3
Time, t →
Viterbi search over the large trellis
• Exact search is infeasible for large vocabulary tasks
• Unknown word boundaries
• Ngram language models greatly increase the search space
• Solutions
• Compactly represent the search space using WFST-based optimisations
• Beam search: Prune away parts of the search space that aren’t promising
Viterbi search over the large trellis
• Exact search is infeasible for large vocabulary tasks
• Unknown word boundaries
• Ngram language models greatly increase the search space
• Solutions
• Compactly represent the search space using WFST-based optimisations
• Beam search: Prune away parts of the search space that aren’t promising
Two main WFST Optimizations
Recall not all weighted transducers are determinizable
To ensure determinizability of L ○ G, introduce disambiguation symbols in L to deal with homophones in the lexicon
read : r eh d #0 red : r eh d #1
• Use determinization to reduce/eliminate redundancy
Propagate the disambiguation symbols as self-loops back to C and H. Resulting machines are H̃ , C̃ , L̃
• Use determinization to reduce/eliminate redundancy
• Use minimization to reduce space requirements
Two main WFST Optimizations
Minimization ensures that the final composed machine has minimum number of states
Final optimization cascade:
N = πε(min(det(H̃ ○ det(C̃ ○ det(L̃ ○ G)))))
Replaces disambiguation symbols in input alphabet of H̃ with ε
Example G
0 1
bob:bob bond:bond
rob:rob
2 slept:slept
read:read ate:ate
Compact language models (G)
• Use Backoff Ngram language models for G
a,b b,c
b
ε
c c / Pr(c|a,b)
ε / α(a,b)
c / Pr(c|b) ε / α(b,c)
ε / α(b) c / Pr(c)
Example G
0 1
bob:bob bond:bond
rob:rob
2 slept:slept
read:read ate:ate
Example L ̃ :Lexicon with disambig symbols
0
b:bob 1
b:bond 5 r:rob 9 s:slept 12
17 r:read
20 ey:ate
aa:- 2
aa:- 6 aa:- 10 l:- 13
eh:- 18
21 t:-
b:- 3
4
#0:- -:-
n:- 7
8 d:-
-:-
11 b:-
-:-
14 eh:-
15 p:-
16 t:-
-:-
19 d:-
-:-
-:-
L ̃ ○ G
0
b:bob 1
b:bond 2
3 r:rob
aa:- 4
aa:- 5
aa:- 6
b:- 7
n:- 8
b:- 9
#0:- 10
d:- 11
12 -:-
-:- -:-
s:slept 13
r:read 14
15 ey:ate
l:- 16
eh:- 17
t:- 18
eh:- 19
d:- 20
p:- 21
t:- 22
det(L ̃ ○ G)
0
b:- 1
2 r:rob
aa:- 3
aa:- 4
b:bob 5
n:bond 6
b:- 7
#0:- 8
d:- 9
10 -:-
-:- -:-
r:read 11
s:slept 12
13 ey:ate
eh:- 14
l:- 15
t:- 16
d:- 17
eh:- 18
p:- 19
t:- 20
min(det(L ̃ ○ G))
0
b:- 1
2 r:rob
aa:- 3
aa:- 4
b:bob 5
n:bond 6
7 b:-
#0:-
d:- -:- 8
r:read 9
s:slept 10 11
ey:ate eh:- 12
l:- 13
t:- 14 d:-
eh:- 15 p:-
det(L ̃ ○ G)
0
b:- 1
2 r:rob
aa:- 3
aa:- 4
b:bob 5
n:bond 6
b:- 7
#0:- 8
d:- 9
10 -:-
-:- -:-
r:read 11
s:slept 12
13 ey:ate
eh:- 14
l:- 15
t:- 16
d:- 17
eh:- 18
p:- 19
t:- 20
Mehryar Mohri - Speech Recognition page Courant Institute, NYU
1st-Pass Recognition Networks 40K NAB Eval '95
23
transducer x real-time C ◦ L ◦ G 12.5
C ◦ det ( L ◦ G ) 1.2 det ( H ◦ C ◦ L ◦ G ) 1.0 push ( min ( F )) 0.7
Recognition speed of the first-pass networks in the NAB 40,000-word vocabulary task at 83% word accuracy.
1st pass recognition networks (40K vocab)
Recognition speeds for systems with an accuracy of 83%
Viterbi search over the large trellis
• Exact search is infeasible for large vocabulary tasks
• Unknown word boundaries
• Ngram language models greatly increase the search space
• Solutions
• Compactly represent the search space using WFST-based optimisations
• Beam search: Prune away parts of the search space that aren’t promising
Beam pruning
• At each time-step t, only retain those nodes in the time-state trellis that are within a fixed threshold δ (beam width) of the best path
• Given active nodes from the last time-step:
• Examine nodes in the current time-step …
• … that are reachable from active nodes in the previous time- step
• Get active nodes for the current time-step by only retaining nodes with hypotheses that score close to the score of the best hypothesis
Beam search
• Beam search at each node keeps only hypotheses with scores that fall within a threshold of the current best hypothesis
• Hypotheses with Q(t, s) < δ ⋅ max Q(t, s’) are pruned here, δ controls the beam width
• Search errors could occur if the most probable hypothesis gets pruned
• Trade-off between balancing search errors and speeding up decoding
Static and dynamic networks
• What we’ve seen so far: Static decoding graph
• H ○ C ○ L ○ G
• Determinize/minimize to make this graph more compact
• Another approach: Dynamic graph expansion
• Dynamically build the graph with active states on the fly
• Do on-the-fly composition with the language model G
• (H ○ C ○ L) ○ G