• No results found

Automatic Speech Recognition (CS753)

N/A
N/A
Protected

Academic year: 2022

Share "Automatic Speech Recognition (CS753)"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Preethi Jyothi August 3, 2017

Automatic Speech Recognition (CS753)

Lecture 4: WFST algorithms contd. + WFSTs in ASR

Automatic Speech Recognition (CS753)

(2)

Quiz-1 Postmortem

Common Mistakes:

Missing insertion/deletion
 in E.fst

Forgot to mark final 
 states/self-loops

Output vocabulary for 
 T.fst has to be complete 
 words, “bad”, “bead”, etc. 


rather than letters

a) E.fst

b) T.fst

0 10 20 30 40 50

Correct Incorrect

(3)

Project Proposal

Start brainstorming!

In case of doubt, discuss potential ideas with me during my office hours (Thur, 5:00 pm to 6:30 pm)

Once decided, you will have to fill out a form specifying:

Title of the project

Names/roll numbers of all project members

A 300-400 word abstract of the proposed project

Due by 11:59 pm on Aug 14th

(4)

Composition: Recap

If T1 transduces x to z, 
 and T2 transduces z to y, 


then T1 ○ T2 transduces x to y

Note: output alphabet of T1 ⊆ input alphabet of T2

ͣͤͥ CDE #$%

E.g. If T1 removes punctuation symbols from a string, and T2 changes 
 uppercase letters to lowercase letters, then T1 ⚬ T2 brings about 


both changes

(5)

Determinization: Recap

A (W)FST is deterministic if:

Unique start state

No two transitions from a state share the same input label

No epsilon input labels

Not all WFSAs can be determinized

(6)

Determinization: Weighted FSA

Some Weighted-FSAs are not determinizable! [M97]

Weight of string CDnE = n and weight of CDnF = 2n C

3 0

2

1 E

C

D D

F

After seeing CDn an FSA can’t remember n

[M97] M. Mohri. Finite-State Transducers in Language and Speech Processing. Computational Linguistics, 23(2), 1997

(7)

Determinization: Recap

A (W)FST is deterministic if:

Unique start state

No two transitions from a state share the same input label

No epsilon input labels

Not all WFSAs can be determinized

Guaranteed to yield a deterministic WFSA under some technical conditions characterising the automata (e.g. twins property)

(8)

Minimization

Minimization: find an equivalent deterministic FSA with the least number of states (and transitions)

Unweighted FSAs have a unique minimal FSA [Aho74]

C

3 0

2 F

1

D

E G G

F C

3

0 12

D E

G F

Obtained by identifying and merging equivalent states

Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The design and analysis of computer algorithms. Addison Wesley, 1974.

(9)

Minimization: Weighted FSA

Two states are equivalent only if for every input string, the

outcome — weight assigned to the string, if accepted — starting from the two states are the same

3 0

2 1

3

0 12

Redistribute weights before identifying equivalent states C

F D

E G

G

F

C

D E

G!

F!

(10)

Minimization: Weighted FSA

Reweighting OK as long as resulting WFSA is equivalent Can reweight using a “potential function” on states

C

3 0

2 F

1

D

E G

G

F

“Weight pushing”: Reweighting using a potential function that optimally moves weights towards the start state

C

0 3

2 F

1

D

E G

G

F

2

0 0

1 +2 -2

+1 +1

-1 -1

-2

(11)

Minimization: Weighted FSA

After weight-pushing, can simply apply unweighted FSA minimization (treating label/weight as label)

Guaranteed to yield a minimal WFSA (under some technical conditions required for weight-pushing)

E C

0 12 3

D

G C F

0 3

2 F

1

D

E G

G

F

(12)

Toolkits to work with finite-state machines

AT&T FSM Library (no longer supported)


http://www3.cs.stonybrook.edu/~algorith/implement/fsm/

implement.shtml

RWTH FSA Toolkit


https://www-i6.informatik.rwth-aachen.de/~kanthak/fsa.html

Carmel


https://www.isi.edu/licensed-sw/carmel/

MIT FST Toolkit


http://people.csail.mit.edu/ilh/fst/

OpenFST Toolkit (actively supported)


http://www.openfst.org/twiki/bin/view/FST/WebHome

(13)

Brief Introduction to OpenFst

(14)

ͧP

aC

1 2

anC

0

0 1 an a <eps> 0 an 1

a 2

<eps> 0

a 1

n 2

1 2 <eps> n 0 2 a a 1

2

Input alphabet

(in.txt)

Output alphabet

(out.txt)

“0” label is reserved

for epsilon

A.txt

Quick Intro to OpenFst (www.openfst.org)

(15)

ͧP

aC

1 2/0.1

anC

0

0 1 an a 0.5 1 2 <eps> n 1.0 0 2 a a 0.5 1

2 0.1

Quick Intro to OpenFst (www.openfst.org)

(16)

Compiling & Printing FSTs

The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities

Command used to compile:

fstcompile --isymbols=in.txt --osymbols=out.txt A.txt A.fst

Get back the text FST using a print command with the binary file:

fstprint --isymbols=in.txt --osymbols=out.txt A.fst A.txt

(17)

Drawing FSTs

Small FSTs can be visualized easily using the draw tool:

fstdraw --isymbols=in.txt --osymbols=out.txt A.fst

| dot -Tpdf > A.pdf

0

an:a 1

a:a 2

<eps>:n

(18)

FSTs can get very large!

(19)

WFSTs applied to ASR

(20)

Acoustic
 Indices

WFST-based ASR System

Language


Model Word


Sequence

Acoustic
 Models

Triphones

Context
 Transducer

Monophones

Pronunciation
 Model

Words

(21)

WFST-based ASR System

Acoustic
 Indices

Language


Model Word


Sequence

Acoustic
 Models

Triphones

Context
 Transducer

Monophones

Pronunciation
 Model

Words

H

a/a_b

b/a_b

. . .

x/y_z

One 3-state 
 HMM for 


each 
 triphone

f1

}

FST Union +

Closure

Resulting FST

H

f2

f3 f4 f5 f4 f6

f0:a:a_b

(22)

WFST-based ASR System

M. Mohri: Weighted FSTs in Speech Recognition 12

ε,* x,ε

x:x/ ε_ε

x,x x:x/ ε_x

x,y x:x/ ε_y

y,ε

y:y/ ε_ε

y,x y:y/ ε_x

y,y

y:y/ ε_y x:x/x_ε

x:x/x_x

x:x/x_y

y:y/x_ε y:y/x_x y:y/x_y

x:x/y_ε x:x/y_x

x:x/y_y

y:y/y_ε y:y/y_x y:y/y_y

Figure 8: Context-dependent triphone transducer.

3.1. Transducer Combination

Consider the pronunciation lexicon in Figure 2b. Suppose we form the union of this transducer with the pronunciation transducers for the remaining words in the grammar G of Figure 2a and then take its Kleene closure by connecting an ϵ-transition from each final state to the initial state. The resulting pronuncia- tion lexicon L would pair any sequence of words from that vocabulary to their corresponding pronunciations. Thus,

L G

gives a transducer that maps from phones to word sequences restricted to G.

We used composition here to implement a context-independent substitution.

However, a major advantage of transducers in speech recognition is that they gen- eralize naturally the notion of context-independent substitution of a label to the context-dependent case. The transducer of Figure 8 does not correspond to a sim- ple substitution, since it describes the mapping from context-independent phones

to context-dependent triphonic models, denoted by phone/left context right context.

Just two hypothetical phones x and y are shown for simplicity. Each state en- codes the knowledge of the previous and next phones. State labels in the figure are pairs (a, b) of the past a and the future b, with ϵ representing the start or end of a phone sequence and an unspecified future. For instance, it is easy to see that the phone sequence xyx is mapped by the transducer to x/ϵ y y/x x x/y ϵ via the unique state sequence (ϵ, ∗)(x, y)(y, x)(x, ϵ). More generally, when there are n context-independent phones, this triphonic construction gives a transducer with O(n2) states and O(n3) transitions. A tetraphonic construction would give a transducer with O(n3) states and O(n4) transitions. In real applications, context- dependency transducers will benefit significantly from determinization and mini-

Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

Arc labels: “monophone : phone / left-context_right-context”

C-1:

C

Acoustic
 Indices

Language


Model Word


Sequence

Acoustic
 Models

Triphones

Context
 Transducer

Monophones

Pronunciation
 Model

Words

(23)

Acoustic
 Indices

Language


Model Word


Sequence

Acoustic
 Models

Triphones

Context
 Transducer

Monophones

Pronunciation
 Model

Words

WFST-based ASR System

L

Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

M. Mohri: Weighted FSTs in Speech Recognition 5

(a)

0 using:using/1 1 data:data/0.66 2

intuition:intuition/0.33 3

4 is:is/0.5

are:are/0.5 is:is/1

better:better/0.7 5 worse:worse/0.3

(b) 0

d:data/1 1

5 d:dew/1

ey:ε/0.5 2 ae:ε/0.5

uw:ε/1 6

t:ε/0.3 3

dx:ε/0.7 ax: ε/1 4

Figure 2: Weighted finite-state transducer examples.

and path weight are those given earlier for acceptors. A path’s output label is the concatenation of output labels of its transitions.

The examples in Figure 2 encode (a superset of) the information in the WFSAs of Figure 1a-b as WFSTs. Figure 2a represents the same language model as Figure 1a by giving each transition identical input and output labels. This adds no new information, but is a convenient way of interpreting any acceptor as a transducer that we will use often.

Figure 2b represents a toy pronunciation lexicon as a mapping from phone sequences to words in the lexicon, in this example data and dew, with proba- bilities representing the likelihoods of alternative pronunciations. Since a word pronunciation may be a sequence of several phones, the path corresponding to each pronunciation has ϵ-output labels on all but the word-initial transition. This transducer has more information than the WFSA in Figure 1b. Since words are encoded by the output label, it is possible to combine the pronunciation trans- ducers for more than one word without losing word identity. Similarly, HMM structures of the form given in Figure 1c can can be combined into a single transducer that preserves phone model identity while sharing distribution sub- sequences whenever possible.

2.3. Weighted Transducer Algorithms

Speech recognition architectures commonly give the run-time decoder the task of combining and optimizing transducers such as those in Figure 1. The decoder finds word pronunciations in its lexicon and substitutes them into the grammar.

Phonetic tree representations may be used to improve search efficiency at this point [Ortmanns et al., 1996]. The decoder then identifies the correct context- dependent models to use for each phone in context, and finally substitutes them to create an HMM-level transducer. The software that performs these opera-

(24)

Acoustic
 Indices

Language


Model Word


Sequence

Acoustic
 Models

Triphones

Context
 Transducer

Monophones

Pronunciation
 Model

Words

WFST-based ASR System

0

the birds/0.404

animals/1.789

are/0.693

were/0.693

boy/1.789

is

walking

G

References

Related documents

An encoder-decoder model includes an encoder, which reads in the input (grapheme) sequence, and a decoder, which generates the output (phoneme) sequence.. A typ- ical

We evaluate our DBRNN trained using CTC by decoding with several character-level language models: 5-gram, 7- gram, densely connected neural networks with 1 and 3 hidden layers

We evaluate our DBRNN trained using CTC by decoding with several character-level language models: 5-gram, 7- gram, densely connected neural networks with 1 and 3 hidden layers

15. On 13 October 2008 CEHRD issued a press statement calling upon the Defendant to mobilise its counter spill personnel to the Bodo creek as a matter of urgency. The

Providing cer- tainty that avoided deforestation credits will be recognized in future climate change mitigation policy will encourage the development of a pre-2012 market in

The necessary set of data includes a panel of country-level exports from Sub-Saharan African countries to the United States; a set of macroeconomic variables that would

Percentage of countries with DRR integrated in climate change adaptation frameworks, mechanisms and processes Disaster risk reduction is an integral objective of

The Congo has ratified CITES and other international conventions relevant to shark conservation and management, notably the Convention on the Conservation of Migratory