• No results found

Recall Viterbi search

N/A
N/A
Protected

Academic year: 2022

Share "Recall Viterbi search"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Preethi Jyothi Mar 23, 2017


Automatic Speech Recognition (CS753)

Lecture 18: Search & Decoding (Part I)

Automatic Speech Recognition (CS753)

(2)

Recall ASR Decoding

W = arg max

W

Pr(OA|W ) Pr(W )

W = arg max

wN1 ,N

8<

:

" N Y

n=1

Pr(wn|wn m+1n 1 )

# 2

4 X

q1T ,w1N

YT t=1

Pr(Ot|qt, w1N) Pr(qt|qt 1, w1N) 3 5

9=

;

arg max

w1N,N

(" N Y

n=1

Pr(wn|wn m+1n 1 )

# "

max

q1T,w1N

YT

t=1

Pr(Ot|qt, w1N) Pr(qt|qt 1, w1N)

Viterbi #)

Viterbi approximation divides the above optimisation problem into sub-problems that allows the efficient application of dynamic

programming

An exact search using Viterbi is infeasible for large vocabulary tasks!

(3)

Recall Viterbi search

Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis

Viterbi algorithm: Only needs to maintain information about the most probable path at each state

9.5 HMM TRAINING: THE FORWARD-BACKWARD ALGORITHM 13

start

H

C

H

C

H

C

end

P(C

|start) * P(3

|C)

.2 * .1

P(H|H) * P(1|H) .6 * .2

P(C|C) * P(1|C) .5 * .5 P(C|H) * P(1

.3 * .5 |C)

P(H|C) * P(1|H) .4 * .2 P(H

|start)*P(3

|H)

.8 * .4

v1(2)=.32

v1(1) = .02

v2(2)= max(.32*.12, .02*.08) = .038

v2(1) = max(.32*.15, .02*.25) = .048

start start start

t

C H

end end end

qF

q2

q1

q0

o1 o2 o3

3 1 3

Figure 9.12 The Viterbi backtrace. As we extend each path to a new state account for the next observation, we keep a backpointer (shown with broken lines) to the best path that led us to this state.

Finally, we can give a formal definition of the Viterbi recursion as follows:

1. Initialization:

v1(j) = a0jbj(o1) 1 j N (9.20)

bt1(j) = 0 (9.21)

2. Recursion (recall that states 0 and qF are non-emitting):

vt(j) = maxN

i=1 vt 1(i)ai jbj(ot); 1 j N,1 <t T (9.22) btt(j) = argmaxN

i=1 vt 1(i)ai jbj(ot); 1 j N,1 <t T (9.23) 3. Termination:

The best score: P= vT(qF) = maxN

i=1 vT(i)aiF (9.24) The start of backtrace: qT= btT(qF) = argmaxN

i=1 vT(i)aiF (9.25)

9.5 HMM Training: The Forward-Backward Algorithm

We turn to the third problem for HMMs: learning the parameters of an HMM, that is, the A and B matrices. Formally,

Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

(4)

ASR Search Network

0

the birds are

boy is

walking

d ax

b

oy b

Network of words Network of

phones Network of HMM states

(5)

Time-state trellis

word1word2word3

Time, t →

(6)

Viterbi search over the large trellis

Exact search is infeasible for large vocabulary tasks

Unknown word boundaries

Ngram language models greatly increase the search space

Solutions

Compactly represent the search space using WFST-based optimisations

Beam search: Prune away parts of the search space that aren’t promising

(7)

Viterbi search over the large trellis

Exact search is infeasible for large vocabulary tasks

Unknown word boundaries

Ngram language models greatly increase the search space

Solutions

Compactly represent the search space using WFST-based optimisations

Beam search: Prune away parts of the search space that aren’t promising

(8)

Two main WFST Optimizations

Recall not all weighted transducers are determinizable

To ensure determinizability of L ○ G, introduce disambiguation 
 symbols in L to deal with homophones in the lexicon

read : r eh d #0
 red : r eh d #1

Use determinization to reduce/eliminate redundancy

Propagate the disambiguation symbols as self-loops back to 
 C and H. Resulting machines are H̃ , C̃ , L̃

(9)

Use determinization to reduce/eliminate redundancy

Use minimization to reduce space requirements

Two main WFST Optimizations

Minimization ensures that the final composed machine 
 has minimum number of states

Final optimization cascade:

N = πε(min(det(H̃ ○ det(C̃ ○ det(L̃ ○ G)))))

Replaces disambiguation symbols
 in input alphabet of H̃ with ε

(10)

Example G

0 1

bob:bob bond:bond

rob:rob

2 slept:slept

read:read ate:ate

(11)

Compact language models (G)

Use Backoff Ngram language models for G

a,b b,c

b

ε

c c / Pr(c|a,b)

ε / α(a,b)

c / Pr(c|b) ε / α(b,c)

ε / α(b) c / Pr(c)

(12)

Example G

0 1

bob:bob bond:bond

rob:rob

2 slept:slept

read:read ate:ate

(13)

Example L ̃ :Lexicon with disambig symbols

0

b:bob 1

b:bond 5 r:rob 9 s:slept 12

17 r:read

20 ey:ate

aa:- 2

aa:- 6 aa:- 10 l:- 13

eh:- 18

21 t:-

b:- 3

4

#0:- -:-

n:- 7

8 d:-

-:-

11 b:-

-:-

14 eh:-

15 p:-

16 t:-

-:-

19 d:-

-:-

-:-

(14)

L ̃ ○ G

0

b:bob 1

b:bond 2

3 r:rob

aa:- 4

aa:- 5

aa:- 6

b:- 7

n:- 8

b:- 9

#0:- 10

d:- 11

12 -:-

-:- -:-

s:slept 13

r:read 14

15 ey:ate

l:- 16

eh:- 17

t:- 18

eh:- 19

d:- 20

p:- 21

t:- 22

det(L ̃ ○ G)

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

b:- 7

#0:- 8

d:- 9

10 -:-

-:- -:-

r:read 11

s:slept 12

13 ey:ate

eh:- 14

l:- 15

t:- 16

d:- 17

eh:- 18

p:- 19

t:- 20

(15)

min(det(L ̃ ○ G))

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

7 b:-

#0:-

d:- -:- 8

r:read 9

s:slept 10 11

ey:ate eh:- 12

l:- 13

t:- 14 d:-

eh:- 15 p:-

det(L ̃ ○ G)

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

b:- 7

#0:- 8

d:- 9

10 -:-

-:- -:-

r:read 11

s:slept 12

13 ey:ate

eh:- 14

l:- 15

t:- 16

d:- 17

eh:- 18

p:- 19

t:- 20

(16)

Viterbi search over the large trellis

Exact search is infeasible for large vocabulary tasks

Unknown word boundaries

Ngram language models greatly increase the search space

Solutions

Compactly represent the search space using WFST-based optimisations

Beam search: Prune away parts of the search space that aren’t promising

(17)

Beam pruning

At each time-step t, only retain those nodes in the time-state trellis that are within a fixed threshold δ (beam width) of the best path

Given active nodes from the last time-step:

Examine nodes in the current time-step …

… that are reachable from active nodes in the previous time- step

Get active nodes for the current time-step by only retaining nodes with hypotheses that score close to the score of the best hypothesis

(18)

Beam search

Beam search at each node keeps only hypotheses with scores that fall within a threshold of the current best hypothesis

Hypotheses with Q(t, s) < δ ⋅ max Q(t, s’) are pruned here, δ controls the beam width

Search errors could occur if the most probable hypothesis gets pruned

Trade-off between balancing search errors and speeding up decoding

(19)

Static and dynamic networks

What we’ve seen so far: Static decoding graph

H ○ C ○ L ○ G

Determinize/minimize to make this graph more compact

Another approach: Dynamic graph expansion

Dynamically build the graph with active states on the fly

Do on-the-fly composition with the language model G

(H ○ C ○ L) ○ G

(20)

Multi-pass search

Some models are too expensive to implement in first-pass decoding (e.g. RNN-based LMs)

First-pass decoding: Use simpler model (e.g. Ngram LMs)

to find most probable word sequences

and represent as a word lattice or an N-best list

Rescore first-pass hypotheses using complex model to find the best word sequence

(21)

Multi-pass decoding with N-best lists

DRAFT

Section 10.1. Multipass Decoding: N-best lists and lattices 3

to wy didn’t include wz (i.e., P(wy|wq,wz) was low for all q). Advanced probabilistic LMs like SCFGs also violate the same dynamic programming assumptions.

There are two solutions to these problems with Viterbi decoding. The most com- mon is to modify the Viterbi decoder to return multiple potential utterances, instead of just the single best, and then use other high-level language model or pronunciation- modeling algorithms to re-rank these multiple outputs (Schwartz and Austin, 1991;

Soong and Huang, 1990; Murveit et al., 1993).

The second solution is to employ a completely different decoding algorithm, such as the stack decoder, or A decoder (Jelinek, 1969; Jelinek et al., 1975). We begin

STACK DECODER

A in this section with multiple-pass decoding, and return to stack decoding in the next section.

In multiple-pass decoding we break up the decoding process into two stages. In the first stage we use fast, efficient knowledge sources or algorithms to perform a non- optimal search. So for example we might use an unsophisticated but time-and-space efficient language model like a bigram, or use simplified acoustic models. In the second decoding pass we can apply more sophisticated but slower decoding algorithms on a reduced search space. The interface between these passes is an N-best list or word lattice.

The simplest algorithm for multipass decoding is to modify the Viterbi algorithm to return the N-best sentences (word sequences) for a given speech input. Suppose

N-BEST

for example a bigram grammar is used with such an N-best-Viterbi algorithm to return the 1000 most highly-probable sentences, each with their AM likelihood and LM prior score. This 1000-best list can now be passed to a more sophisticated language model like a trigram grammar. This new LM is used to replace the bigram LM score of each hypothesized sentence with a new trigram LM probability. These priors can be combined with the acoustic likelihood of each sentence to generate a new posterior probability for each sentence. Sentences are thus rescored and re-ranked using this

RESCORED

more sophisticated probability. Fig. 10.1 shows an intuition for this algorithm.

If music be the food of love...

If music be the food of love...

N-Best List

?Every happy family...

?In a hole in the ground...

?If music be the food of love...

?If music be the foot of dove...

?Alice was beginning to get...

N-Best Decoder

Smarter

Knowledge Source

1-Best Utterance Simple

Knowledge Source

speech input

Rescoring

Figure 10.1 The use of N-best decoding as part of a two-stage decoding model. Effi- cient but unsophisticated knowledge sources are used to return the N-best utterances. This significantly reduces the search space for the second pass models, which are thus free to be very sophisticated but slow.

There are a number of algorithms for augmenting the Viterbi algorithm to generate N-best hypotheses. It turns out that there is no polynomial-time admissible algorithm

Simple algorithm: Modify the Viterbi algorithm to return the N- best word sequences for a given speech input

Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

(22)

Multi-pass decoding with N-best lists

Simple algorithm: Modify the Viterbi algorithm to return the N- best word sequences for a given speech input

Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

N-best lists aren’t as diverse as we’d like. And, not enough

information in N-best lists to effectively use other knowledge sources

DRAFT

4 Chapter 10. Speech Recognition: Advanced Topics

for finding the N most likely hypotheses (?). There are however, a number of ap- proximate (non-admissible) algorithms; we will introduce just one of them, the “Exact N-best” algorithm of Schwartz and Chow (1990). In Exact N-best, instead of each state maintaining a single path/backtrace, we maintain up to N different paths for each state.

But we’d like to insure that these paths correspond to different word paths; we don’t want to waste our N paths on different state sequences that map to the same words. To do this, we keep for each path the word history, the entire sequence of words up to the current word/state. If two paths with the same word history come to a state at the same time, we merge the paths and sum the path probabilities. To keep the N best word sequences, the resulting algorithm requires O(N) times the normal Viterbi time.

AM LM

Rank Path logprob logprob

1. it’s an area that’s naturally sort of mysterious -7193.53 -20.25 2. that’s an area that’s naturally sort of mysterious -7192.28 -21.11 3. it’s an area that’s not really sort of mysterious -7221.68 -18.91 4. that scenario that’s naturally sort of mysterious -7189.19 -22.08 5. there’s an area that’s naturally sort of mysterious -7198.35 -21.34 6. that’s an area that’s not really sort of mysterious -7220.44 -19.77 7. the scenario that’s naturally sort of mysterious -7205.42 -21.50 8. so it’s an area that’s naturally sort of mysterious -7195.92 -21.71 9. that scenario that’s not really sort of mysterious -7217.34 -20.70 10. there’s an area that’s not really sort of mysterious -7226.51 -20.01 Figure 10.2 An example 10-Best list from the Broadcast News corpus, produced by the CU-HTK BN system (thanks to Phil Woodland). Logprobs use log10; the language model scale factor (LMSF) is 15.

The result of any of these algorithms is an N-best list like the one shown in Fig. 10.2.

In Fig. 10.2 the correct hypothesis happens to be the first one, but of course the reason to use N-best lists is that isn’t always the case. Each sentence in an N-best list is also annotated with an acoustic model probability and a language model probability. This allows a second-stage knowledge source to replace one of those two probabilities with an improved estimate.

One problem with an N-best list is that when N is large, listing all the sentences is extremely inefficient. Another problem is that N-best lists don’t give quite as much information as we might want for a second-pass decoder. For example, we might want distinct acoustic model information for each word hypothesis so that we can reapply a new acoustic model for the word. Or we might want to have available different start and end times of each word so that we can apply a new duration model.

For this reason, the output of a first-pass decoder is usually a more sophisticated representation called a word lattice (Murveit et al., 1993; Aubert and Ney, 1995). A

WORD LATTICE

word lattice is a directed graph that efficiently represents much more information about possible word sequences.1 In some systems, nodes in the graph are words and arcs are

1 Actually an ASR lattice is not the kind of lattice that may be familiar to you from mathematics, since it is not required to have the properties of a true lattice (i.e., be a partially ordered set with particular properties, such as a unique join for each pair of elements). Really it’s just a graph, but it is conventional to call it a

(23)

Multi-pass decoding with lattices

ASR lattice: Weighted automata/directed graph representing alternate word hypotheses from an ASR system

so, it’s it’s there’s

that’s

that scenario

an area that’s naturally sort of mysterious

the not really

(24)

Multi-pass decoding with lattices

Confusion networks/sausages: Lattices that show competing/

confusable words and can be used to compute posterior probabilities at the word level

it’s there’s

that’s

that scenario

an area that’s naturally sort of mysterious the

not

References

Related documents

Assistant Statistical Officer (State Cad .. Draughtsman Grade-I Local Cadre) ... Senior Assistant (Local

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

Deputy Statistical Officer (State Cadre) ... Deputy Statistical Officer (Local

Section 2 (a) defines, Community Forest Resource means customary common forest land within the traditional or customary boundaries of the village or seasonal use of landscape in

Out of the 4 palindromic peptide sequences identified by the algorithm (figure 3) (minimum number of residues in the palindrome was set as 5), the octa- peptide palindrome sequence

The LDPC decoding algorithms, Sum Product Algorithm (SPA)-Logdomain and SPA-Min Sum Algorithm-logdomainSimple is chosen for Simulation studies, since they are easy

In this section, we present a computational algorithm to obtain the optimal sequence and the range of b in which each of the sequences is optimal, for a given value of learning

II that there is a non-trivial fixed point 共 FP 兲 of the renormalization group 共 RG 兲 in the (a,h) plane; the system is gapless on a quantum critical line of points which flow to