Recall Viterbi search

(1)

Instructor: Preethi Jyothi Mar 23, 2017 

Automatic Speech Recognition (CS753)

Lecture 18: Search & Decoding (Part I)

Automatic Speech Recognition (CS753)

(2)

Recall ASR Decoding

W ^⇤ = arg max

W

Pr(O_A|W ) Pr(W )

W^⇤ = arg max

w^N₁ ,N

8<

:

" _N Y

n=1

Pr(w_n|w_{n m+1}ⁿ ¹ )

# 2

4 X

q₁^T ,w₁^N

YT t=1

Pr(O_t|q_t, w₁^N) Pr(q_t|q_t ₁, w₁^N) 3 5

9=

;

⇡ arg max

w₁^N,N

(" _N Y

n=1

Pr(w_n|w_{n m+1}ⁿ ¹ )

# "

max

q₁^T,w₁^N

YT

t=1

Pr(O_t|q_t, w₁^N) Pr(q_t|q_t ₁, w₁^N)

Viterbi #)

• Viterbi approximation divides the above optimisation problem into sub-problems that allows the eﬀicient application of dynamic

programming

• An exact search using Viterbi is infeasible for large vocabulary tasks!

(3)

Recall Viterbi search

• Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis

• Viterbi algorithm: Only needs to maintain information about the most probable path at each state

9.5 • HMM TRAINING: THE FORWARD-BACKWARD ALGORITHM 13

start

H

C

H

C

H

C

end

P(C

|start) * P(3

|C)

.2 * .1

P(H|H) * P(1|H) .6 * .2

P(C|C) * P(1|C) .5 * .5 P(C|H) * P(1

.3 * .5 |C)

P(H|C) * P(1|H) .4 * .2 P(H

|start)*P(3

|H)

.8 * .4

v₁(2)=.32

v₁(1) = .02

v₂(2)= max(.32*.12, .02*.08) = .038

v₂(1) = max(.32*.15, .02*.25) = .048

start start start

t

C H

end end end

q_F

q₂

q₁

q₀

o₁ o₂ o₃

3 1 3

Figure 9.12 The Viterbi backtrace. As we extend each path to a new state account for the next observation, we keep a backpointer (shown with broken lines) to the best path that led us to this state.

Finally, we can give a formal definition of the Viterbi recursion as follows:

1. Initialization:

v₁(j) = a₀_jb_j(o₁) 1  j N (9.20)

bt₁(j) = 0 (9.21)

2. Recursion (recall that states 0 and q_F are non-emitting):

v_t(j) = max^N

i=1 v_t ₁(i)a_{i j}b_j(o_t); 1  j N,1 <t  T (9.22) bt_t(j) = argmax^N

i=1 v_t ₁(i)a_{i j}b_j(o_t); 1  j  N,1 <t T (9.23) 3. Termination:

The best score: P⇤= v_T(q_F) = max^N

i=1 v_T(i)⇤a_iF (9.24) The start of backtrace: q_T⇤= bt_T(q_F) = argmax^N

i=1 v_T(i)⇤a_iF (9.25)

9.5 HMM Training: The Forward-Backward Algorithm

We turn to the third problem for HMMs: learning the parameters of an HMM, that is, the A and B matrices. Formally,

Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

(4)

ASR Search Network

0

the birds are

boy is

walking

d ax

b

oy b

Network of words Network of

phones Network of HMM states

(5)

Time-state trellis

word1word2word3

Time, t →

(6)

Viterbi search over the large trellis

• Exact search is infeasible for large vocabulary tasks

• Unknown word boundaries

• Ngram language models greatly increase the search space

• Solutions

• Compactly represent the search space using WFST-based optimisations

• Beam search: Prune away parts of the search space that aren’t promising

(7)

Viterbi search over the large trellis

• Solutions

(8)

Two main WFST Optimizations

Recall not all weighted transducers are determinizable

To ensure determinizability of L ○ G, introduce disambiguation   symbols in L to deal with homophones in the lexicon

read : r eh d #0  red : r eh d #1

• Use determinization to reduce/eliminate redundancy

Propagate the disambiguation symbols as self-loops back to   C and H. Resulting machines are H̃ , C̃ , L̃

(9)

• Use determinization to reduce/eliminate redundancy

• Use minimization to reduce space requirements

Two main WFST Optimizations

Minimization ensures that the final composed machine   has minimum number of states

Final optimization cascade:

N = πε(min(det(H̃ ○ det(C̃ ○ det(L̃ ○ G)))))

Replaces disambiguation symbols  in input alphabet of H̃ with ε

(10)

Example G

0 1

bob:bob bond:bond

rob:rob

2 slept:slept

read:read ate:ate

(11)

Compact language models (G)

• Use Backoﬀ Ngram language models for G

a,b b,c

b

ε

c c / Pr(c|a,b)

ε / α(a,b)

c / Pr(c|b) ε / α(b,c)

ε / α(b) c / Pr(c)

(12)

Example G

0 1

bob:bob bond:bond

rob:rob

2 slept:slept

read:read ate:ate

(13)

Example L ̃ :Lexicon with disambig symbols

0

b:bob 1

b:bond 5 r:rob 9 s:slept 12

17 r:read

20 ey:ate

aa:- 2

aa:- 6 aa:- 10 l:- 13

eh:- 18

21 t:-

b:- 3

4

#0:- -:-

n:- 7

8 d:-

-:-

11 b:-

-:-

14 eh:-

15 p:-

16 t:-

-:-

19 d:-

-:-

(14)

L ̃ ○ G

0

b:bob 1

b:bond 2

3 r:rob

aa:- 4

aa:- 5

aa:- 6

b:- 7

n:- 8

b:- 9

#0:- 10

d:- 11

12 -:-

-:- -:-

s:slept 13

r:read 14

15 ey:ate

l:- 16

eh:- 17

t:- 18

eh:- 19

d:- 20

p:- 21

t:- 22

det(L ̃ ○ G)

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

b:- 7

#0:- 8

d:- 9

10 -:-

-:- -:-

r:read 11

s:slept 12

13 ey:ate

eh:- 14

l:- 15

t:- 16

d:- 17

eh:- 18

p:- 19

t:- 20

(15)

min(det(L ̃ ○ G))

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

7 b:-

#0:-

d:- -:- 8

r:read 9

s:slept 10 11

ey:ate eh:- 12

l:- 13

t:- 14 d:-

eh:- 15 p:-

det(L ̃ ○ G)

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

b:- 7

#0:- 8

d:- 9

10 -:-

-:- -:-

r:read 11

s:slept 12

13 ey:ate

eh:- 14

l:- 15

t:- 16

d:- 17

eh:- 18

p:- 19

t:- 20

(16)

Viterbi search over the large trellis

• Solutions

(17)

Beam pruning

• At each time-step t, only retain those nodes in the time-state trellis that are within a fixed threshold δ (beam width) of the best path

• Given active nodes from the last time-step:

• Examine nodes in the current time-step …

• … that are reachable from active nodes in the previous time- step

• Get active nodes for the current time-step by only retaining nodes with hypotheses that score close to the score of the best hypothesis

(18)

Beam search

• Beam search at each node keeps only hypotheses with scores that fall within a threshold of the current best hypothesis

• Hypotheses with Q(t, s) < δ ⋅ max Q(t, s’) are pruned here, δ controls the beam width

• Search errors could occur if the most probable hypothesis gets pruned

• Trade-oﬀ between balancing search errors and speeding up decoding

(19)

Static and dynamic networks

• What we’ve seen so far: Static decoding graph

• H ○ C ○ L ○ G

• Determinize/minimize to make this graph more compact

• Another approach: Dynamic graph expansion

• Dynamically build the graph with active states on the fly

• Do on-the-fly composition with the language model G

• (H ○ C ○ L) ○ G

(20)

Multi-pass search

• Some models are too expensive to implement in first-pass decoding (e.g. RNN-based LMs)

• First-pass decoding: Use simpler model (e.g. Ngram LMs)

• to find most probable word sequences

• and represent as a word lattice or an N-best list

• Rescore first-pass hypotheses using complex model to find the best word sequence

(21)

Multi-pass decoding with N-best lists

DRAFT

Section 10.1. Multipass Decoding: N-best lists and lattices 3

to w_y didn’t include w_z (i.e., P(w_y|w_q,w_z) was low for all q). Advanced probabilistic LMs like SCFGs also violate the same dynamic programming assumptions.

There are two solutions to these problems with Viterbi decoding. The most com- mon is to modify the Viterbi decoder to return multiple potential utterances, instead of just the single best, and then use other high-level language model or pronunciation- modeling algorithms to re-rank these multiple outputs (Schwartz and Austin, 1991;

Soong and Huang, 1990; Murveit et al., 1993).

The second solution is to employ a completely different decoding algorithm, such as the stack decoder, or A^∗ decoder (Jelinek, 1969; Jelinek et al., 1975). We begin

STACK DECODER

A∗ in this section with multiple-pass decoding, and return to stack decoding in the next section.

In multiple-pass decoding we break up the decoding process into two stages. In the first stage we use fast, efficient knowledge sources or algorithms to perform a non- optimal search. So for example we might use an unsophisticated but time-and-space efficient language model like a bigram, or use simplified acoustic models. In the second decoding pass we can apply more sophisticated but slower decoding algorithms on a reduced search space. The interface between these passes is an N-best list or word lattice.

The simplest algorithm for multipass decoding is to modify the Viterbi algorithm to return the N-best sentences (word sequences) for a given speech input. Suppose

N-BEST

for example a bigram grammar is used with such an N-best-Viterbi algorithm to return the 1000 most highly-probable sentences, each with their AM likelihood and LM prior score. This 1000-best list can now be passed to a more sophisticated language model like a trigram grammar. This new LM is used to replace the bigram LM score of each hypothesized sentence with a new trigram LM probability. These priors can be combined with the acoustic likelihood of each sentence to generate a new posterior probability for each sentence. Sentences are thus rescored and re-ranked using this

RESCORED

more sophisticated probability. Fig. 10.1 shows an intuition for this algorithm.

If music be the food of love...

N-Best List

?Every happy family...

?In a hole in the ground...

?If music be the food of love...

?If music be the foot of dove...

?Alice was beginning to get...

N-Best Decoder

Smarter

Knowledge Source

1-Best Utterance Simple

Knowledge Source

speech input

Rescoring

Figure 10.1 The use of N-best decoding as part of a two-stage decoding model. Effi- cient but unsophisticated knowledge sources are used to return the N-best utterances. This significantly reduces the search space for the second pass models, which are thus free to be very sophisticated but slow.

There are a number of algorithms for augmenting the Viterbi algorithm to generate N-best hypotheses. It turns out that there is no polynomial-time admissible algorithm

• Simple algorithm: Modify the Viterbi algorithm to return the N- best word sequences for a given speech input

Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

(22)

Multi-pass decoding with N-best lists

• Simple algorithm: Modify the Viterbi algorithm to return the N- best word sequences for a given speech input

Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

• N-best lists aren’t as diverse as we’d like. And, not enough

information in N-best lists to eﬀectively use other knowledge sources

DRAFT

4 Chapter 10. Speech Recognition: Advanced Topics

for finding the N most likely hypotheses (?). There are however, a number of ap- proximate (non-admissible) algorithms; we will introduce just one of them, the “Exact N-best” algorithm of Schwartz and Chow (1990). In Exact N-best, instead of each state maintaining a single path/backtrace, we maintain up to N different paths for each state.

But we’d like to insure that these paths correspond to different word paths; we don’t want to waste our N paths on different state sequences that map to the same words. To do this, we keep for each path the word history, the entire sequence of words up to the current word/state. If two paths with the same word history come to a state at the same time, we merge the paths and sum the path probabilities. To keep the N best word sequences, the resulting algorithm requires O(N) times the normal Viterbi time.

AM LM

Rank Path logprob logprob

1. it’s an area that’s naturally sort of mysterious -7193.53 -20.25 2. that’s an area that’s naturally sort of mysterious -7192.28 -21.11 3. it’s an area that’s not really sort of mysterious -7221.68 -18.91 4. that scenario that’s naturally sort of mysterious -7189.19 -22.08 5. there’s an area that’s naturally sort of mysterious -7198.35 -21.34 6. that’s an area that’s not really sort of mysterious -7220.44 -19.77 7. the scenario that’s naturally sort of mysterious -7205.42 -21.50 8. so it’s an area that’s naturally sort of mysterious -7195.92 -21.71 9. that scenario that’s not really sort of mysterious -7217.34 -20.70 10. there’s an area that’s not really sort of mysterious -7226.51 -20.01 Figure 10.2 An example 10-Best list from the Broadcast News corpus, produced by the CU-HTK BN system (thanks to Phil Woodland). Logprobs use log₁₀; the language model scale factor (LMSF) is 15.

The result of any of these algorithms is an N-best list like the one shown in Fig. 10.2.

In Fig. 10.2 the correct hypothesis happens to be the first one, but of course the reason to use N-best lists is that isn’t always the case. Each sentence in an N-best list is also annotated with an acoustic model probability and a language model probability. This allows a second-stage knowledge source to replace one of those two probabilities with an improved estimate.

One problem with an N-best list is that when N is large, listing all the sentences is extremely inefficient. Another problem is that N-best lists don’t give quite as much information as we might want for a second-pass decoder. For example, we might want distinct acoustic model information for each word hypothesis so that we can reapply a new acoustic model for the word. Or we might want to have available different start and end times of each word so that we can apply a new duration model.

For this reason, the output of a first-pass decoder is usually a more sophisticated representation called a word lattice (Murveit et al., 1993; Aubert and Ney, 1995). A

WORD LATTICE

word lattice is a directed graph that efficiently represents much more information about possible word sequences.¹ In some systems, nodes in the graph are words and arcs are

1 Actually an ASR lattice is not the kind of lattice that may be familiar to you from mathematics, since it is not required to have the properties of a true lattice (i.e., be a partially ordered set with particular properties, such as a unique join for each pair of elements). Really it’s just a graph, but it is conventional to call it a

(23)

Multi-pass decoding with lattices

• ASR lattice: Weighted automata/directed graph representing alternate word hypotheses from an ASR system

so, it’s it’s there’s

that’s

that scenario

an area that’s naturally sort of mysterious

the not really

(24)

Multi-pass decoding with lattices

• Confusion networks/sausages: Lattices that show competing/

confusable words and can be used to compute posterior probabilities at the word level