Instructor: Preethi Jyothi August 3, 2017
Automatic Speech Recognition (CS753)
Lecture 4: WFST algorithms contd. + WFSTs in ASR
Automatic Speech Recognition (CS753)
Quiz-1 Postmortem
• Common Mistakes:
• Missing insertion/deletion in E.fst
• Forgot to mark final states/self-loops
• Output vocabulary for T.fst has to be complete words, “bad”, “bead”, etc.
rather than letters
a) E.fst
b) T.fst
0 10 20 30 40 50
Correct Incorrect
Project Proposal
• Start brainstorming!
• In case of doubt, discuss potential ideas with me during my office hours (Thur, 5:00 pm to 6:30 pm)
• Once decided, you will have to fill out a form specifying:
• Title of the project
• Names/roll numbers of all project members
• A 300-400 word abstract of the proposed project
• Due by 11:59 pm on Aug 14th
Composition: Recap
• If T1 transduces x to z, and T2 transduces z to y,
then T1 ○ T2 transduces x to y
• Note: output alphabet of T1 ⊆ input alphabet of T2
ͣͤͥ CDE #$%
• E.g. If T1 removes punctuation symbols from a string, and T2 changes uppercase letters to lowercase letters, then T1 ⚬ T2 brings about
both changes
Determinization: Recap
• A (W)FST is deterministic if:
• Unique start state
• No two transitions from a state share the same input label
• No epsilon input labels
• Not all WFSAs can be determinized
Determinization: Weighted FSA
Some Weighted-FSAs are not determinizable! [M97]
Weight of string CDnE = n and weight of CDnF = 2n C
3 0
2
1 E
C
D D
F
After seeing CDn an FSA can’t remember n
[M97] M. Mohri. Finite-State Transducers in Language and Speech Processing. Computational Linguistics, 23(2), 1997
Determinization: Recap
• A (W)FST is deterministic if:
• Unique start state
• No two transitions from a state share the same input label
• No epsilon input labels
• Not all WFSAs can be determinized
• Guaranteed to yield a deterministic WFSA under some technical conditions characterising the automata (e.g. twins property)
Minimization
Minimization: find an equivalent deterministic FSA with the least number of states (and transitions)
Unweighted FSAs have a unique minimal FSA [Aho74]
C
3 0
2 F
1
D
E G G
F C
3
0 12
D E
G F
Obtained by identifying and merging equivalent states
Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The design and analysis of computer algorithms. Addison Wesley, 1974.
Minimization: Weighted FSA
Two states are equivalent only if for every input string, the
outcome — weight assigned to the string, if accepted — starting from the two states are the same
3 0
2 1
3
0 12
Redistribute weights before identifying equivalent states C
F D
E G
G
F
C
D E
G!
F!
Minimization: Weighted FSA
Reweighting OK as long as resulting WFSA is equivalent Can reweight using a “potential function” on states
C
3 0
2 F
1
D
E G
G
F
“Weight pushing”: Reweighting using a potential function that optimally moves weights towards the start state
C
0 3
2 F
1
D
E G
G
F
2
0 0
1 +2 -2
+1 +1
-1 -1
-2
Minimization: Weighted FSA
After weight-pushing, can simply apply unweighted FSA minimization (treating label/weight as label)
Guaranteed to yield a minimal WFSA (under some technical conditions required for weight-pushing)
E C
0 12 3
D
G C F
0 3
2 F
1
D
E G
G
F
Toolkits to work with finite-state machines
• AT&T FSM Library (no longer supported)
http://www3.cs.stonybrook.edu/~algorith/implement/fsm/
implement.shtml
• RWTH FSA Toolkit
https://www-i6.informatik.rwth-aachen.de/~kanthak/fsa.html
• Carmel
https://www.isi.edu/licensed-sw/carmel/
• MIT FST Toolkit
http://people.csail.mit.edu/ilh/fst/
• OpenFST Toolkit (actively supported)
http://www.openfst.org/twiki/bin/view/FST/WebHome
Brief Introduction to OpenFst
ͧP
aC
1 2
anC
0
0 1 an a <eps> 0 an 1
a 2
<eps> 0
a 1
n 2
1 2 <eps> n 0 2 a a 1
2
Input alphabet
(in.txt)
Output alphabet
(out.txt)
“0” label is reserved
for epsilon
A.txt
Quick Intro to OpenFst (www.openfst.org)
ͧP
aC
1 2/0.1
anC
0
0 1 an a 0.5 1 2 <eps> n 1.0 0 2 a a 0.5 1
2 0.1
Quick Intro to OpenFst (www.openfst.org)
Compiling & Printing FSTs
The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities
• Command used to compile:
fstcompile --isymbols=in.txt --osymbols=out.txt A.txt A.fst
• Get back the text FST using a print command with the binary file:
fstprint --isymbols=in.txt --osymbols=out.txt A.fst A.txt
Drawing FSTs
Small FSTs can be visualized easily using the draw tool:
fstdraw --isymbols=in.txt --osymbols=out.txt A.fst
| dot -Tpdf > A.pdf
0
an:a 1
a:a 2
<eps>:n
FSTs can get very large!
WFSTs applied to ASR
Acoustic Indices
WFST-based ASR System
Language
Model Word
Sequence
Acoustic Models
Triphones
Context Transducer
Monophones
Pronunciation Model
Words
WFST-based ASR System
Acoustic Indices
Language
Model Word
Sequence
Acoustic Models
Triphones
Context Transducer
Monophones
Pronunciation Model
Words
H
a/a_b
b/a_b
. . .
x/y_z
One 3-state HMM for
each triphone
f1:ε
}
FST Union +Closure
Resulting FST
H
f2:ε
f3:ε f4:ε f5:ε f4:ε f6:ε
f0:a:a_b
WFST-based ASR System
M. Mohri: Weighted FSTs in Speech Recognition 12
ε,* x,ε
x:x/ ε_ε
x,x x:x/ ε_x
x,y x:x/ ε_y
y,ε
y:y/ ε_ε
y,x y:y/ ε_x
y,y
y:y/ ε_y x:x/x_ε
x:x/x_x
x:x/x_y
y:y/x_ε y:y/x_x y:y/x_y
x:x/y_ε x:x/y_x
x:x/y_y
y:y/y_ε y:y/y_x y:y/y_y
Figure 8: Context-dependent triphone transducer.
3.1. Transducer Combination
Consider the pronunciation lexicon in Figure 2b. Suppose we form the union of this transducer with the pronunciation transducers for the remaining words in the grammar G of Figure 2a and then take its Kleene closure by connecting an ϵ-transition from each final state to the initial state. The resulting pronuncia- tion lexicon L would pair any sequence of words from that vocabulary to their corresponding pronunciations. Thus,
L ◦ G
gives a transducer that maps from phones to word sequences restricted to G.
We used composition here to implement a context-independent substitution.
However, a major advantage of transducers in speech recognition is that they gen- eralize naturally the notion of context-independent substitution of a label to the context-dependent case. The transducer of Figure 8 does not correspond to a sim- ple substitution, since it describes the mapping from context-independent phones
to context-dependent triphonic models, denoted by phone/left context right context.
Just two hypothetical phones x and y are shown for simplicity. Each state en- codes the knowledge of the previous and next phones. State labels in the figure are pairs (a, b) of the past a and the future b, with ϵ representing the start or end of a phone sequence and ∗ an unspecified future. For instance, it is easy to see that the phone sequence xyx is mapped by the transducer to x/ϵ y y/x x x/y ϵ via the unique state sequence (ϵ, ∗)(x, y)(y, x)(x, ϵ). More generally, when there are n context-independent phones, this triphonic construction gives a transducer with O(n2) states and O(n3) transitions. A tetraphonic construction would give a transducer with O(n3) states and O(n4) transitions. In real applications, context- dependency transducers will benefit significantly from determinization and mini-
Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002
Arc labels: “monophone : phone / left-context_right-context”
C-1:
C
Acoustic Indices
Language
Model Word
Sequence
Acoustic Models
Triphones
Context Transducer
Monophones
Pronunciation Model
Words
Acoustic Indices
Language
Model Word
Sequence
Acoustic Models
Triphones
Context Transducer
Monophones
Pronunciation Model
Words
WFST-based ASR System
L
Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002
M. Mohri: Weighted FSTs in Speech Recognition 5
(a)
0 using:using/1 1 data:data/0.66 2
intuition:intuition/0.33 3
4 is:is/0.5
are:are/0.5 is:is/1
better:better/0.7 5 worse:worse/0.3
(b) 0
d:data/1 1
5 d:dew/1
ey:ε/0.5 2 ae:ε/0.5
uw:ε/1 6
t:ε/0.3 3
dx:ε/0.7 ax: ε/1 4
Figure 2: Weighted finite-state transducer examples.
and path weight are those given earlier for acceptors. A path’s output label is the concatenation of output labels of its transitions.
The examples in Figure 2 encode (a superset of) the information in the WFSAs of Figure 1a-b as WFSTs. Figure 2a represents the same language model as Figure 1a by giving each transition identical input and output labels. This adds no new information, but is a convenient way of interpreting any acceptor as a transducer that we will use often.
Figure 2b represents a toy pronunciation lexicon as a mapping from phone sequences to words in the lexicon, in this example data and dew, with proba- bilities representing the likelihoods of alternative pronunciations. Since a word pronunciation may be a sequence of several phones, the path corresponding to each pronunciation has ϵ-output labels on all but the word-initial transition. This transducer has more information than the WFSA in Figure 1b. Since words are encoded by the output label, it is possible to combine the pronunciation trans- ducers for more than one word without losing word identity. Similarly, HMM structures of the form given in Figure 1c can can be combined into a single transducer that preserves phone model identity while sharing distribution sub- sequences whenever possible.
2.3. Weighted Transducer Algorithms
Speech recognition architectures commonly give the run-time decoder the task of combining and optimizing transducers such as those in Figure 1. The decoder finds word pronunciations in its lexicon and substitutes them into the grammar.
Phonetic tree representations may be used to improve search efficiency at this point [Ortmanns et al., 1996]. The decoder then identifies the correct context- dependent models to use for each phone in context, and finally substitutes them to create an HMM-level transducer. The software that performs these opera-
Acoustic Indices
Language
Model Word
Sequence
Acoustic Models
Triphones
Context Transducer
Monophones
Pronunciation Model
Words
WFST-based ASR System
0
the birds/0.404
animals/1.789
are/0.693
were/0.693
boy/1.789
is
walking
G