speaker, room acoustics, noise, microphone
Automatic Speech Recognition
word string
language model
can include semantics hypothesestime align, pattern match utterance
local match
probability estimationfront end
signal processing and feature extraction noiseglobal decoder
xn xn xn
3 2
p(q | q )
2 1
p(q | q )
q2
q1 q3
p(q | q )
1 1 p(q | q ) p(q | q )
2 2 3 3
1
p(x | q )
2 p(x | q )
p(x | q )n n n 3
P(phone | acoustic vectors)
Acoustic Vectors
g = p g
(1-g)
<F > = p(1-g)
<F > = (1-p)g
21
q q q
3 2 1
q q
6 5 4
q
q q q
32 1
...
time HMM states
n
2 3 4 5 6 k K
"d"
"a"
"d"
1
. . . . .
k.
x + acoustic context ANN
.
. .
.
.
.
.
. .
.
n
...
p(q | x )
n.
P(phone | acoustic vectors)
Acoustic Vectors 500-4000 hidden units
vectors 9 26-dimensional
61 phones
dh ax kcl k ae tcl t
train MLP for task
recognize developmental set
train MLP with TIMIT
MLP weights recognize
developmental set
baseline score
viterbi alignment
labeled training
MLP weights
score
improved?
Yes
No
Done
Hidden
Data
left context
right context
binary binary
c.d. output left c.d. output right
c.i. output
Hidden
Data Output Probabilities
M/F
train MLP for task
recognize developmental set
viterbi alignment
Done
No Yes
MLP weights
score
improved?
train MLP with TIMIT
MLP weights recognize
developmental set
baseline score
labeled training generate multiple
pronunciation lexicon
I input units H hidden units O output units
I Input H/n Hidden
O Output
I Input H/n Hidden
O Output
Monolithic Net vs. Parallel Net Architecture
Net 1 Net 2
. . . .
Function Averaging
I Input H/n Hidden
O Output
Net n