• No results found

Putting it all together:

N/A
N/A
Protected

Academic year: 2022

Share "Putting it all together: "

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Preethi Jyothi Sep 21, 2017


Automatic Speech Recognition (CS753)

Lecture 15: Search & Decoding (Part I)

Automatic Speech Recognition (CS753)

(2)

The Big Picture

(3)

Putting it all together:

How do we recognise an utterance?

A: speech utterance

OA: acoustic features corresponding to the utterance A

Return the word sequence that jointly assigns the highest probability to OA

How do we estimate Pr(OA|W) and Pr(W)?

How do we decode?

W = arg max

W

Pr(OA|W ) Pr(W )

(4)

Acoustic Model

Pr(OA|W ) = X

Q

Pr(OA, Q|W )

= X

q1T ,w1N

YT

t=1

Pr(Ot|O1t 1, q1t, w1N ) Pr(qt|q1t 1, w1N )

X

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )

max

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )

W = arg max

W

Pr(OA|W ) Pr(W )

Pr(OA|W ) = X

Q

Pr(OA, Q|W )

= X

q1T ,w1N

YT

t=1

Pr(Ot|O1t 1, q1t, w1N ) Pr(qt|q1t 1, w1N )

X

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )

max

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N ) Pr(OA|W ) = X

Q

Pr(OA, Q|W )

= X

q1T ,w1N

YT

t=1

Pr(Ot|O1t 1, q1t, w1N ) Pr(qt|q1t 1, w1N )

X

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )

max

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N ) Pr(OA|W ) = X

Q

Pr(OA, Q|W )

= X

q1T ,w1N

YT

t=1

Pr(Ot|O1t 1, q1t, w1N ) Pr(qt|q1t 1, w1N )

X

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )

max

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )

First-order HMM assumptions

Viterbi approximation

(5)

Acoustic Model

Pr(OA|W ) = max

q1T ,w1N

YT

t=1

Pr(Ot|qt, w1N ) Pr(qt|qt 1, w1N )

Emission
 probabilities

Pr(O|q; w1N ) =

Lq

X

`=1

cq`N (O|µq`, q`; w1N )

Modeled using a 
 mixture of Gaussians

Transition
 probabilities

All the free parameters (means, covariances, mixture weights, transition probabilities) are learned using EM (Baum-Welch) algorithm

(6)

Language Model

W = arg max

W

Pr(OA|W ) Pr(W )

m-gram language model

Further optimized using smoothing and interpolation with lower-order Ngram models

Pr(W ) = Pr(w1, w2, . . . , wN )

= Pr(w1) . . . Pr(wN |wNN m+11 )

(7)

Decoding: Search

W = arg max

W

Pr(OA|W ) Pr(W )

W = arg max

wN1 ,N

8<

:

" N Y

n=1

Pr(wn|wn m+1n 1 )

# 2

4 X

q1T ,w1N

YT t=1

Pr(Ot|qt, w1N) Pr(qt|qt 1, w1N) 3 5

9=

;

arg max

w1N,N

(" N Y

n=1

Pr(wn|wn m+1n 1 )

# "

max

q1T,w1N

YT

t=1

Pr(Ot|qt, w1N) Pr(qt|qt 1, w1N)

Viterbi #)

Viterbi approximation divides the above optimisation problem into sub- problems that allows the efficient application of dynamic programming

Search space still very huge for LVCSR tasks! Use approximate decoding techniques (beam-width decoding, etc.) to visit only promising parts of the search space

(8)

Recall Viterbi search

Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis

Viterbi algorithm: Only needs to maintain information about the most probable path at each state

9.5 HMM TRAINING: THE FORWARD-BACKWARD ALGORITHM 13

start

H

C

H

C

H

C

end

P(C

|start) * P(3

|C)

.2 * .1

P(H|H) * P(1|H) .6 * .2

P(C|C) * P(1|C) .5 * .5 P(C|H) * P(1

.3 * .5 |C)

P(H|C) * P(1|H) .4 * .2 P(H

|start)*P(3

|H)

.8 * .4

v1(2)=.32

v1(1) = .02

v2(2)= max(.32*.12, .02*.08) = .038

v2(1) = max(.32*.15, .02*.25) = .048

start start start

t

C H

end end end

qF

q2

q1

q0

o1 o2 o3

3 1 3

Figure 9.12 The Viterbi backtrace. As we extend each path to a new state account for the next observation, we keep a backpointer (shown with broken lines) to the best path that led us to this state.

Finally, we can give a formal definition of the Viterbi recursion as follows:

1. Initialization:

v1(j) = a0jbj(o1) 1 j N (9.20)

bt1(j) = 0 (9.21)

2. Recursion (recall that states 0 and qF are non-emitting):

vt(j) = maxN

i=1 vt 1(i)ai jbj(ot); 1 j N,1 <t T (9.22) btt(j) = argmaxN

i=1 vt 1(i)ai jbj(ot); 1 j N,1 <t T (9.23) 3. Termination:

The best score: P= vT(qF) = maxN

i=1 vT(i)aiF (9.24) The start of backtrace: qT= btT(qF) = argmaxN

i=1 vT(i)aiF (9.25)

9.5 HMM Training: The Forward-Backward Algorithm

We turn to the third problem for HMMs: learning the parameters of an HMM, that is, the A and B matrices. Formally,

Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

(9)

ASR Search Network

0

the birds are

boy is

walking

d ax

b

oy b

Network of words Network of

phones Network of HMM states

(10)

Time-state trellis

word1word2word3

Time, t →

(11)

Viterbi search over the large trellis

Exact search is infeasible for large vocabulary tasks

Unknown word boundaries

Ngram language models greatly increase the search space

Solutions

Compactly represent the search space using WFST-based optimisations

Beam search: Prune away parts of the search space that aren’t promising

(12)

Viterbi search over the large trellis

Exact search is infeasible for large vocabulary tasks

Unknown word boundaries

Ngram language models greatly increase the search space

Solutions

Compactly represent the search space using WFST-based optimisations

Beam search: Prune away parts of the search space that aren’t promising

(13)

Two main WFST Optimizations

Recall not all weighted transducers are determinizable

To ensure determinizability of L ○ G, introduce disambiguation 
 symbols in L to deal with homophones in the lexicon

read : r eh d #0
 red : r eh d #1

Use determinization to reduce/eliminate redundancy

Propagate the disambiguation symbols as self-loops back to 
 C and H. Resulting machines are H̃ , C̃ , L̃

(14)

Use determinization to reduce/eliminate redundancy

Use minimization to reduce space requirements

Two main WFST Optimizations

Minimization ensures that the final composed machine 
 has minimum number of states

Final optimization cascade:

N = πε(min(det(H̃ ○ det(C̃ ○ det(L̃ ○ G)))))

Replaces disambiguation symbols
 in input alphabet of H̃ with ε

(15)

Example G

0 1

bob:bob bond:bond

rob:rob

2 slept:slept

read:read ate:ate

(16)

Compact language models (G)

Use Backoff Ngram language models for G

a,b b,c

b

ε

c c / Pr(c|a,b)

ε / α(a,b)

c / Pr(c|b) ε / α(b,c)

ε / α(b) c / Pr(c)

(17)

Example G

0 1

bob:bob bond:bond

rob:rob

2 slept:slept

read:read ate:ate

(18)

Example L ̃ :Lexicon with disambig symbols

0

b:bob 1

b:bond 5 r:rob 9 s:slept 12

17 r:read

20 ey:ate

aa:- 2

aa:- 6 aa:- 10 l:- 13

eh:- 18

21 t:-

b:- 3

4

#0:- -:-

n:- 7

8 d:-

-:-

11 b:-

-:-

14 eh:-

15 p:-

16 t:-

-:-

19 d:-

-:-

-:-

(19)

L ̃ ○ G

0

b:bob 1

b:bond 2

3 r:rob

aa:- 4

aa:- 5

aa:- 6

b:- 7

n:- 8

b:- 9

#0:- 10

d:- 11

12 -:-

-:- -:-

s:slept 13

r:read 14

15 ey:ate

l:- 16

eh:- 17

t:- 18

eh:- 19

d:- 20

p:- 21

t:- 22

det(L ̃ ○ G)

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

b:- 7

#0:- 8

d:- 9

10 -:-

-:- -:-

r:read 11

s:slept 12

13 ey:ate

eh:- 14

l:- 15

t:- 16

d:- 17

eh:- 18

p:- 19

t:- 20

(20)

min(det(L ̃ ○ G))

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

7 b:-

#0:-

d:- -:- 8

r:read 9

s:slept 10 11

ey:ate eh:- 12

l:- 13

t:- 14 d:-

eh:- 15 p:-

det(L ̃ ○ G)

0

b:- 1

2 r:rob

aa:- 3

aa:- 4

b:bob 5

n:bond 6

b:- 7

#0:- 8

d:- 9

10 -:-

-:- -:-

r:read 11

s:slept 12

13 ey:ate

eh:- 14

l:- 15

t:- 16

d:- 17

eh:- 18

p:- 19

t:- 20

(21)

Mehryar Mohri - Speech Recognition page Courant Institute, NYU

1st-Pass Recognition Networks 40K NAB Eval '95

23

transducer x real-time C ◦ L ◦ G 12.5

C ◦ det ( L ◦ G ) 1.2 det ( H ◦ C ◦ L ◦ G ) 1.0 push ( min ( F )) 0.7

Recognition speed of the first-pass networks in the NAB 40,000-word vocabulary task at 83% word accuracy.

1st pass recognition networks (40K vocab)

Recognition speeds for systems with an accuracy of 83%

(22)

Viterbi search over the large trellis

Exact search is infeasible for large vocabulary tasks

Unknown word boundaries

Ngram language models greatly increase the search space

Solutions

Compactly represent the search space using WFST-based optimisations

Beam search: Prune away parts of the search space that aren’t promising

(23)

Beam pruning

At each time-step t, only retain those nodes in the time-state trellis that are within a fixed threshold δ (beam width) of the best path

Given active nodes from the last time-step:

Examine nodes in the current time-step …

… that are reachable from active nodes in the previous time- step

Get active nodes for the current time-step by only retaining nodes with hypotheses that score close to the score of the best hypothesis

(24)

Beam search

Beam search at each node keeps only hypotheses with scores that fall within a threshold of the current best hypothesis

Hypotheses with Q(t, s) < δ ⋅ max Q(t, s’) are pruned here, δ controls the beam width

Search errors could occur if the most probable hypothesis gets pruned

Trade-off between balancing search errors and speeding up decoding

(25)

Static and dynamic networks

What we’ve seen so far: Static decoding graph

H ○ C ○ L ○ G

Determinize/minimize to make this graph more compact

Another approach: Dynamic graph expansion

Dynamically build the graph with active states on the fly

Do on-the-fly composition with the language model G

(H ○ C ○ L) ○ G

References

Related documents

1. The white-collar crimes are committed by people who are financially secure and perform such illegal acts for satisfying their wants. These crimes are generally moved

– Discharge in a given stream is related to elevation of the water surface (stage) through a series of careful measurements?. – Stage of the stream is observed routinely and

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

With an aim to conduct a multi-round study across 18 states of India, we conducted a pilot study of 177 sample workers of 15 districts of Bihar, 96 per cent of whom were

With respect to other government schemes, only 3.7 per cent of waste workers said that they were enrolled in ICDS, out of which 50 per cent could access it after lockdown, 11 per

1. The white-collar crimes are committed by people who are financially secure and perform such illegal acts for satisfying their wants. These crimes are generally moved

The synchro transmitter – control transformer pair thus acts as an error detector giving a voltage signal at the rotor terminals of the control transformer proportional to the

– The ring has a wraparound connection between the end nodes of the linear array (The nodes at extremities are connected). – Each node has exactly