• No results found

Lecture 8: Tied state HMMs + DNNs in ASR

N/A
N/A
Protected

Academic year: 2022

Share "Lecture 8: Tied state HMMs + DNNs in ASR"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Preethi Jyothi Aug 17, 2017


Automatic Speech Recognition (CS753)

Lecture 8: Tied state HMMs + DNNs in ASR

Automatic Speech Recognition (CS753)

(2)

Final Project Landscape

Musical Note
 Extraction

I know who!

Sign language to speech conversion

InfoGAN for 
 music

Code by Voice

Keystroke detection from

keyboard acoustics

Stenograph


Script generator for
 conversations

Swapping instruments in

recordings Transcribing TED

Talks

Emotion

recognition using multimodal cues

End-to-end speech-to-text

translation Voice conversion


using GANs

Guitar note

recognition Audio classification ASR for speech


with stutters Text-independent


Speaker Verification

Multi-source speech extraction

in noisy environments

Voice assistant

End-to-end

speaker recognition

(3)

Recap: Tied state HMMs

Four main steps in building a tied state HMM system:

1. Create and train 3-state monophone HMMs with single Gaussian

observation probability densities

2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-

Welch estimation. Transition matrix remains common across all triphones of each phone.

3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together.

4. Number of mixture components in each tied state is increased and

models re-estimated using BW

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

(4)

Tied state HMMs

Four main steps in building a tied state HMM system:

1. Create and train 3-state monophone HMMs with single Gaussian

observation probability densities

2. Clone these monophone distributions to initialise a set of untied triphone models. Train them using Baum-

Welch estimation. Transition matrix remains common across all triphones of each phone.

3. For all triphones derived from the same monophone, cluster states whose parameters should be tied together.

4. Number of mixture components in each tied state is increased and

models re-estimated using BW

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

Which states should be tied together? Use decision trees.

(5)

How do we build these phone DTs?

1. What questions are used?


Linguistically-inspired binary questions: “Does the left or right phone come from a broad class of phones such as vowels, stops, etc.?” “Is the left or right phone [k] or [m]?”

2. What is the training data for each phone state, p

j

? (root node of DT)

(6)

Training data for DT nodes

Align training data, x

i

= (x

i1

, …, x

iTi

) i=1…N where x

it

∈ ℝ

d

, against a set of triphone HMMs

Use Viterbi algorithm to find the best HMM state sequence corresponding to each x

i

Tag each x

it

with ID of current phone along with left-context and right-context

{ { {

xit

sil/b/aa b/aa/g aa/g/sil

x

it

is tagged with ID aa

2

[b/g] i.e. x

it

is aligned with the second state of the 3-state HMM corresponding to the triphone b/aa/g

For a state j in phone p , collect all x

it

’s

that are tagged with ID p

j

[?/?]

(7)

How do we build these phone DTs?

1. What questions are used? 


Linguistically-inspired binary questions: “Does the left or right phone come from a broad class of phones such as vowels, stops, etc.?” “Is the left or right phone [k] or [m]?”

2. What is the training data for each phone state, p

j

? (root node of DT)


All speech frames that align with the j

th

state of every triphone HMM that has p as the middle phone

3. What criterion is used at each node to find the best question to split the data on? 


Find the question which partitions the states in the parent node so

as to give the maximum increase in log likelihood

(8)

If a cluster of HMM states, S = {s

1

, s

2

, …, s

M

} consists of M states and a total of K acoustic observation vectors are associated with S, {x

1

, x

2

…, x

K

} , then the log likelihood associated with S is:

For a question q that splits S into S

yes

and S

no

, compute the following quantity:

Go through all questions, find Δ

q

for each question q and choose the question for which Δ

q

is the biggest

Terminate when: Final Δ

q

is below a threshold or data associated with a split falls below a threshold

Likelihood of a cluster of states

L (S ) =

X

K

i=1

X

s2S

log Pr(x

i

; µ

S

, ⌃

S

)

s

(x

i

)

q = L(Syesq ) + L(Snoq ) L(S)

(9)

Likelihood criterion

Given a phonetic question, let the initial set of untied states S be split into two partitions S

yes

and S

no

Each partition is clustered to form a single Gaussian output

distribution with mean µ

Syes

and covariance Σ

Syes

Use the likelihood of the parent state and the subsequent split

states to determine which question a node should be split on

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994

(10)

Example: Phonetic Decision Tree (DT)

Is left ctxt a vowel?

Yes No

Leaf A aa/ow2/f,
 aa/ow2/s,

DT for center state of [ow]

Is right ctxt a

fricative? Is right ctxt nasal?

Yes No

Leaf B aa/ow2/d,
 aa/ow2/g,

Leaf E aa/ow2/n,
 aa/ow2/m,

No Yes

Is right ctxt a glide?

Leaf C h/ow2/l,
 b/ow2/r,

Leaf D h/ow2/p,
 b/ow2/k,

Yes No

Uses all training data 
 tagged as ow

2

[?/?]

One tree is constructed for each state of each phone to cluster all the 
 corresponding triphone states

Head node

aa/ow2/f, aa/ow2/s,
 aa/ow2/d, h/ow2/p, aa/ow2/n, aa/ow2/g,

(11)

For an unseen triphone at test time

Transition Matrix:

All triphones of a given phoneme use the same transition matrix common to all triphones of a phoneme

State observation densities:

Use the triphone identity to traverse all the way to a leaf of the decision tree

Use the state observation probabilities associated with

that leaf

(12)

That’s a wrap on HMM-based acoustic models

Acoustic
 Indices

Language


Model Word


Sequence

Acoustic
 Models

Triphones

Context
 Transducer

Monophones

Pronunciation
 Model

Words

H

a/a_b

b/a_b

. . .

x/y_z

One 3-state 
 HMM for 


each 
 tied-state


triphone;

parameters estimated
 using Baum-Welch


algorithm

f1

}

FST Union +

Closure

Resulting FST

H

f2

f3 f4 f5 f4 f6

f0:a:a_b

(13)

DNN-based acoustic models?

Acoustic
 Indices

Language


Model Word


Sequence

Acoustic
 Models

Triphones

Context
 Transducer

Monophones

Pronunciation
 Model

Words

H

Can we use

deep neural networks
 instead of HMMs to


learn mappings 
 between acoustics 


and phones?

}

ResultingFST

H

DAHL et al.: CONTEXT-DEPENDENT PRE-TRAINED DEEP NEURAL NETWORKS FOR LVSR 35

Fig. 1. Diagram of our hybrid architecture employing a deep neural network.

The HMM models the sequential property of the speech signal, and the DNN models the scaled observation likelihood of all the senones (tied tri-phone states). The same DNN is replicated over different points in time.

A. Architecture of CD-DNN-HMMs

Fig. 1 illustrates the architecture of our proposed CD-DNN- HMMs. The foundation of the hybrid approach is the use of a forced alignment to obtain a frame level labeling for training the ANN. The key difference between the CD-DNN-HMM archi- tecture and earlier ANN-HMM hybrid architectures (and con- text-independent DNN-HMMs) is that we model senones as the DNN output units directly. The idea of using senones as the modeling unit has been proposed in [22] where the posterior probabilities of senones were estimated using deep-structured conditional random fields (CRFs) and only one audio frame was used as the input of the posterior probability estimator.

This change offers two primary advantages. First, we can im- plement a CD-DNN-HMM system with only minimal modifica- tions to an existing CD-GMM-HMM system, as we will show in Section II-B. Second, any improvements in modeling units that are incorporated into the CD-GMM-HMM baseline system, such as cross-word triphone models, will be accessible to the DNN through the use of the shared training labels.

If DNNs can be trained to better predict senones, then CD-DNN-HMMs can achieve better recognition accu- racy than tri-phone GMM-HMMs. More precisely, in our CD-DNN-HMMs, the decoded word sequence is determined as

(13) where is the language model (LM) probability, and

(14)

(15) is the acoustic model (AM) probability. Note that the observa- tion probability is

(16)

where is the state (senone) posterior probability esti- mated from the DNN, is the prior probability of each state (senone) estimated from the training set, and is indepen- dent of the word sequence and thus can be ignored. Although dividing by the prior probability (called scaled likelihood estimation by [38], [40], [41]) may not give improved recog- nition accuracy under some conditions, we have found it to be very important in alleviating the label bias problem, especially when the training utterances contain long silence segments.

B. Training Procedure of CD-DNN-HMMs

CD-DNN-HMMs can be trained using the embedded Viterbi algorithm. The main steps involved are summarized in Algo- rithm 1, which takes advantage of the triphone tying structures and the HMMs of the CD-GMM-HMM system. Note that the logical triphone HMMs that are effectively equivalent are clus- tered and represented by a physical triphone (i.e., several log- ical triphones are mapped to the same physical triphone). Each physical triphone has several (typically 3) states which are tied and represented by senones. Each senone is given a

as the label to fine-tune the DNN. The mapping maps each physical triphone state to the corresponding . Algorithmic 1 Main Steps to Train CD-DNN-HMMs

1) Train a best tied-state CD-GMM-HMM system where state tying is determined based on the data-driven

decision tree. Denote the CD-GMM-HMM gmm-hmm.

2) Parse gmm-hmm and give each senone name an

ordered starting from 0. The will

be served as the training label for DNN fine-tuning.

3) Parse gmm-hmm and generate a mapping from each physical tri-phone state (e.g., b-ah t.s2) to the corresponding . Denote this mapping

.

4) Convert gmm-hmm to the corresponding

CD-DNN-HMM by borrowing the

tri-phone and senone structure as well as the transition probabilities from .

5) Pre-train each layer in the DNN bottom-up layer by layer and call the result ptdnn.

6) Use to generate a state-level alignment on the training set. Denote the alignment .

7) Convert to where each physical tri-phone state is converted to .

8) Use the associated with each frame in

to fine-tune the DBN using back-propagation or other approaches, starting from . Denote the DBN

.

9) Estimate the prior probability , where is the number of frames associated with senone in and is the total number of frames.

10) Re-estimate the transition probabilities using and to maximize the likelihood of observing the features. Denote the new CD-DNN-HMM

.

11) Exit if no recognition accuracy improvement is observed in the development set; Otherwise use Phone posteriors

(14)

Brief Introduction to Neural Networks

(15)

Hidden 
 Layer

Feed-forward Neural Network

Input 
 Layer

Output 


Layer

(16)

Feed-forward Neural Network 


Brain Metaphor

g

(activation
 function)

w

i

y

i

y

i

=g(Σ

i

w

i ⋅

x

i

) x

i

Single neuron

Image from: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png

(17)

Feed-forward Neural Network 


Parameterized Model

1

2

3

4

5

w

24

w

13

w

14

w

23

w

35

w

45

a

5

a

5

= g(w

35

⋅ a

3

+ w

45

⋅ a

4

)

= g(w

35

⋅ (g(w

13

⋅ a

1

+ w

23

⋅ a

2

)) +

w

45

⋅ (g(w

14

⋅ a

1

+ w

24

⋅ a

2

)))

If x is a 2-dimensional vector and the layer above it is a 2-dimensional vector h, a fully-connected layer is associated with:

h = xW + b

where w

ij

in W is the weight of the connection between i

th

neuron in the input row and j

th

neuron in the first hidden layer and b is the bias vector

Parameters of 


the network: all w

ij


 (and biases not 


shown here)

x

1

x

2

(18)

Feed-forward Neural Network 


Parameterized Model

A 1-layer feedforward neural network has the form:

MLP(x) = g(xW

1

+ b

1

) W

2

+ b

2

1

2

3

4

5

w

24

w

13

w

14

w

23

w

35

w

45

a

5

x

1

x

2

a

5

= g(w

35

⋅ a

3

+ w

45

⋅ a

4

)

= g(w

35

⋅ (g(w

13

⋅ a

1

+ w

23

⋅ a

2

)) +

w

45

⋅ (g(w

14

⋅ a

1

+ w

24

⋅ a

2

)))

The simplest neural network is the perceptron:

Perceptron(x) = xW + b

(19)

Common Activation Functions (g)

sigmoid

−10 −5 0 5 10

0.00.20.40.60.81.0

nonlinear activation functions

x

output

Sigmoid: σ( x ) = 1/(1 + e

-x

)

(20)

Common Activation Functions (g)

sigmoid

−10 −5 0 5 10

1.00.50.00.51.0

nonlinear activation functions

x

output

tanh

Hyperbolic tangent (tanh): tanh( x ) = (e

2x

- 1)/(e

2x

+ 1)

Sigmoid: σ( x ) = 1/(1 + e

-x

)

(21)

Common Activation Functions (g)

sigmoid tanhReLU

−10 −5 0 5 10

0246810

nonlinear activation functions

x

output

Rectified Linear Unit (ReLU): RELU( x ) = max(0, x )

Hyperbolic tangent (tanh): tanh( x ) = (e

2x

- 1)/(e

2x

+ 1)

Sigmoid: σ( x ) = 1/(1 + e

-x

)

(22)

Optimization Problem

To train a neural network, define a loss function L(y,ỹ): 


a function of the true output y and the predicted output ỹ

L(y,ỹ) assigns a non-negative numerical score to the neural network’s output, ỹ

The parameters of the network are set to minimise L over the training examples (i.e. a sum of losses over different training samples)

L is typically minimised using a gradient-based method

(23)

Stochastic Gradient Descent (SGD)

Inputs:

Function NN(x; θ), Training examples, x

1

… x

n

and 
 outputs, y

1

… y

n

and Loss function L.

do until stopping criterion


Pick a training example x

i

, y

i

Compute the loss L(NN(x

i

; θ), y

i

)


Compute gradient of L, ∇L with respect to θ
 θ ← θ - η ∇L

done

Return: θ

SGD Algorithm

(24)

Training a Neural Network

Define the Loss function to be minimised as a node L

Goal: Learn weights for the neural network which minimise L

Gradient Descent: Find ∂L/∂w for every weight w , and update it as 
 w ← w - η ∂L/ ∂w

How do we efficiently compute ∂L/∂w for all w ?

Will compute ∂L/∂u for every node u in the network!

∂L/∂w = ∂L/∂u ∂u/∂w where u is the node which uses w

(25)

Training a Neural Network

New goal: compute ∂L/∂u for every node u in the network Simple algorithm: Backpropagation

Key fact: Chain rule of differentiation

If L can be written as a function of variables v

1

,…, v

n

, which in turn depend (partially) on another variable u , then

∂L/∂u = Σ

i

∂L/∂v

i

∂v

i

/∂u

(26)

Backpropagation

If L can be written as a function of variables v

1

,…, v

n

, which in turn depend (partially) on another variable u , then

∂L/∂u = Σ

i

∂L/∂v

i

∂v

i

/∂u

Then, the chain rule gives

∂L/∂u = Σ

v ∈ Γ(u)

∂L/∂v ∂v/∂u u

L Consider v

1

,…, v

n

as the layer 


above u, Γ(u)

v

(27)

Backpropagation

u

L v

∂L/∂u = Σ

v ∈ Γ(u)

∂L/∂v ∂v/∂u

Backpropagation Base case: ∂L/∂L = 1 For each u (top to bottom):

For each v ∈ Γ(u):

Inductively, have
 computed ∂L/∂v

Directly compute ∂v/∂u Compute ∂L/∂u

Forward Pass

First, in a forward pass, compute values of all

nodes given an input


(The values of each node will be needed during

backprop)

Compute ∂L/∂w 


where ∂L/∂w = ∂L/∂u ⋅ ∂u/∂w

Where values computed in the forward pass may be needed

(28)

History of Neural Networks in ASR

Neural networks for speech recognition were explored as early as 1987

Deep neural networks for speech

Beat state-of-the-art on the TIMIT corpus [M09]

Significant improvements shown on large-vocabulary systems [D11]

Dominant ASR paradigm [H12]

[M09] A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” NIPS Workshop on Deep Learning for Speech Recognition, 2009.

[D11] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” TASL 20(1), pp. 30–42, 2012.

[H12] G. Hinton, et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition”, IEEE Signal Processing Magazine, 2012.

(29)

What’s new?

Why have NN-based systems come back to prominence?

Important developments

Vast quantities of data available for ASR training

Fast GPU-based training

Improvements in optimization/initialization techniques

Deeper networks enabled by fast training

Larger output spaces enabled by fast training and

availability of data

(30)

Neural Networks for ASR

Two main categories of approaches have been explored:

1. Hybrid neural network-HMM systems: Use DNNs to estimate HMM observation probabilities

2. Tandem system: NNs used to generate input features

that are fed to an HMM-GMM acoustic model

(31)

Neural Networks for ASR

Two main categories of approaches have been explored:

1. Hybrid neural network-HMM systems: Use DNNs to estimate HMM observation probabilities

2. Tandem system: DNNs used to generate input features

that are fed to an HMM-GMM acoustic model

(32)

Decoding an ASR system

Recall how we decode the most likely word sequence W for an acoustic sequence O:

The acoustic model Pr( O | W ) can be further decomposed as (here, Q, M represent triphone, monophone sequences resp.):

W

= arg max

W

Pr(O | W ) Pr(W )

Pr(O|W) = X

Q,M

Pr(O, Q, M|W)

= X

Q,M

Pr(O|Q, M, W ) Pr(Q|M, W ) Pr(M|W)

⇡ X

Q,M

Pr(O|Q) Pr(Q|M) Pr(M|W)

(33)

Hybrid system decoding

You’ve seen Pr( O | Q ) estimated using a Gaussian Mixture Model. 


Let’s use a neural network instead to model Pr( O |Q).

Pr(O | W ) ⇡ X

Q,M

Pr(O | Q) Pr(Q | M ) Pr(M | W )

Pr(O | Q) = Y

t

Pr(o

t

| q

t

)

Pr(o

t

| q

t

) = Pr(q

t

| o

t

) Pr(o

t

) Pr(q

t

)

/ Pr(q

t

| o

t

) Pr(q

t

)

where o

t

is the acoustic vector at time t and q

t

is a triphone HMM state 


Here, Pr(q

t

|o

t

) are posteriors from a trained neural network. Pr(o

t

|q

t

) is

then a scaled posterior.

(34)

Computing Pr(q t |o t ) using a deep NN

DAHL et al.: CONTEXT-DEPENDENT PRE-TRAINED DEEP NEURAL NETWORKS FOR LVSR 35

Fig. 1. Diagram of our hybrid architecture employing a deep neural network.

The HMM models the sequential property of the speech signal, and the DNN models the scaled observation likelihood of all the senones (tied tri-phone states). The same DNN is replicated over different points in time.

A. Architecture of CD-DNN-HMMs

Fig. 1 illustrates the architecture of our proposed CD-DNN- HMMs. The foundation of the hybrid approach is the use of a forced alignment to obtain a frame level labeling for training the ANN. The key difference between the CD-DNN-HMM archi- tecture and earlier ANN-HMM hybrid architectures (and con- text-independent DNN-HMMs) is that we model senones as the DNN output units directly. The idea of using senones as the modeling unit has been proposed in [22] where the posterior probabilities of senones were estimated using deep-structured conditional random fields (CRFs) and only one audio frame was used as the input of the posterior probability estimator.

This change offers two primary advantages. First, we can im- plement a CD-DNN-HMM system with only minimal modifica- tions to an existing CD-GMM-HMM system, as we will show in Section II-B. Second, any improvements in modeling units that are incorporated into the CD-GMM-HMM baseline system, such as cross-word triphone models, will be accessible to the DNN through the use of the shared training labels.

If DNNs can be trained to better predict senones, then CD-DNN-HMMs can achieve better recognition accu- racy than tri-phone GMM-HMMs. More precisely, in our CD-DNN-HMMs, the decoded word sequence is determined as

(13) where is the language model (LM) probability, and

(14)

(15) is the acoustic model (AM) probability. Note that the observa- tion probability is

(16)

where is the state (senone) posterior probability esti- mated from the DNN, is the prior probability of each state (senone) estimated from the training set, and is indepen- dent of the word sequence and thus can be ignored. Although dividing by the prior probability (called scaled likelihood estimation by [38], [40], [41]) may not give improved recog- nition accuracy under some conditions, we have found it to be very important in alleviating the label bias problem, especially when the training utterances contain long silence segments.

B. Training Procedure of CD-DNN-HMMs

CD-DNN-HMMs can be trained using the embedded Viterbi algorithm. The main steps involved are summarized in Algo- rithm 1, which takes advantage of the triphone tying structures and the HMMs of the CD-GMM-HMM system. Note that the logical triphone HMMs that are effectively equivalent are clus- tered and represented by a physical triphone (i.e., several log- ical triphones are mapped to the same physical triphone). Each physical triphone has several (typically 3) states which are tied and represented by senones. Each senone is given a

as the label to fine-tune the DNN. The mapping maps each physical triphone state to the corresponding . Algorithmic 1 Main Steps to Train CD-DNN-HMMs

1) Train a best tied-state CD-GMM-HMM system where state tying is determined based on the data-driven

decision tree. Denote the CD-GMM-HMM gmm-hmm.

2) Parse gmm-hmm and give each senone name an

ordered starting from 0. The will

be served as the training label for DNN fine-tuning.

3) Parse gmm-hmm and generate a mapping from each physical tri-phone state (e.g., b-ah t.s2) to the corresponding . Denote this mapping

.

4) Convert gmm-hmm to the corresponding

CD-DNN-HMM – by borrowing the

tri-phone and senone structure as well as the transition probabilities from – .

5) Pre-train each layer in the DNN bottom-up layer by layer and call the result ptdnn.

6) Use – to generate a state-level alignment on the training set. Denote the alignment – .

7) Convert – to where each physical tri-phone state is converted to .

8) Use the associated with each frame in

to fine-tune the DBN using back-propagation or other approaches, starting from . Denote the DBN

.

9) Estimate the prior probability , where is the number of frames associated with senone in and is the total number of frames.

10) Re-estimate the transition probabilities using and – to maximize the likelihood of observing the features. Denote the new CD-DNN-HMM

– .

11) Exit if no recognition accuracy improvement is observed in the development set; Otherwise use

Fixed window of 
 5 speech frames

Triphone 
 state labels

39 features

in one frame

… …

How do we get these labels 


in order to train the NN?

(35)

Triphone labels

Forced alignment: Use current acoustic model to find the most likely sequence of HMM states given a sequence of acoustic

vectors. (Algorithm to help compute this?)

The “Viterbi paths” for the training data is referred to as forced alignment

o1

Triphone HMMs
 (Viterbi)

o2 o3 o4 oT

……

sil1 /b/


aa

sil1 /b/


aa

sil2 /b/


aa

sil2 /b/


aa

………

ee3 /k/


sil

Training word
 sequence

w1,…,wN

Dictionary Phone


sequence p1,…,pN

(36)

Computing Pr(q t |o t ) using a deep NN

DAHL et al.: CONTEXT-DEPENDENT PRE-TRAINED DEEP NEURAL NETWORKS FOR LVSR 35

Fig. 1. Diagram of our hybrid architecture employing a deep neural network.

The HMM models the sequential property of the speech signal, and the DNN models the scaled observation likelihood of all the senones (tied tri-phone states). The same DNN is replicated over different points in time.

A. Architecture of CD-DNN-HMMs

Fig. 1 illustrates the architecture of our proposed CD-DNN- HMMs. The foundation of the hybrid approach is the use of a forced alignment to obtain a frame level labeling for training the ANN. The key difference between the CD-DNN-HMM archi- tecture and earlier ANN-HMM hybrid architectures (and con- text-independent DNN-HMMs) is that we model senones as the DNN output units directly. The idea of using senones as the modeling unit has been proposed in [22] where the posterior probabilities of senones were estimated using deep-structured conditional random fields (CRFs) and only one audio frame was used as the input of the posterior probability estimator.

This change offers two primary advantages. First, we can im- plement a CD-DNN-HMM system with only minimal modifica- tions to an existing CD-GMM-HMM system, as we will show in Section II-B. Second, any improvements in modeling units that are incorporated into the CD-GMM-HMM baseline system, such as cross-word triphone models, will be accessible to the DNN through the use of the shared training labels.

If DNNs can be trained to better predict senones, then CD-DNN-HMMs can achieve better recognition accu- racy than tri-phone GMM-HMMs. More precisely, in our CD-DNN-HMMs, the decoded word sequence is determined as

(13) where is the language model (LM) probability, and

(14)

(15) is the acoustic model (AM) probability. Note that the observa- tion probability is

(16)

where is the state (senone) posterior probability esti- mated from the DNN, is the prior probability of each state (senone) estimated from the training set, and is indepen- dent of the word sequence and thus can be ignored. Although dividing by the prior probability (called scaled likelihood estimation by [38], [40], [41]) may not give improved recog- nition accuracy under some conditions, we have found it to be very important in alleviating the label bias problem, especially when the training utterances contain long silence segments.

B. Training Procedure of CD-DNN-HMMs

CD-DNN-HMMs can be trained using the embedded Viterbi algorithm. The main steps involved are summarized in Algo- rithm 1, which takes advantage of the triphone tying structures and the HMMs of the CD-GMM-HMM system. Note that the logical triphone HMMs that are effectively equivalent are clus- tered and represented by a physical triphone (i.e., several log- ical triphones are mapped to the same physical triphone). Each physical triphone has several (typically 3) states which are tied and represented by senones. Each senone is given a

as the label to fine-tune the DNN. The mapping maps each physical triphone state to the corresponding . Algorithmic 1 Main Steps to Train CD-DNN-HMMs

1) Train a best tied-state CD-GMM-HMM system where state tying is determined based on the data-driven

decision tree. Denote the CD-GMM-HMM gmm-hmm.

2) Parse gmm-hmm and give each senone name an

ordered starting from 0. The will

be served as the training label for DNN fine-tuning.

3) Parse gmm-hmm and generate a mapping from each physical tri-phone state (e.g., b-ah t.s2) to the corresponding . Denote this mapping

.

4) Convert gmm-hmm to the corresponding

CD-DNN-HMM – by borrowing the

tri-phone and senone structure as well as the transition probabilities from – .

5) Pre-train each layer in the DNN bottom-up layer by layer and call the result ptdnn.

6) Use – to generate a state-level alignment on the training set. Denote the alignment – .

7) Convert – to where each physical tri-phone state is converted to .

8) Use the associated with each frame in

to fine-tune the DBN using back-propagation or other approaches, starting from . Denote the DBN

.

9) Estimate the prior probability , where is the number of frames associated with senone in and is the total number of frames.

10) Re-estimate the transition probabilities using and – to maximize the likelihood of observing the features. Denote the new CD-DNN-HMM

– .

11) Exit if no recognition accuracy improvement is observed in the development set; Otherwise use

Fixed window of 
 5 speech frames

Triphone 
 state labels

39 features

in one frame

… …

How do we get these labels 
 in order to train the NN? 


(Viterbi) Forced alignment

(37)

Computing priors Pr(q t )

To compute HMM observation probabilities, Pr(o

t

|q

t

), we need both Pr(q

t

|o

t

) and Pr(q

t

)

The posterior probabilities Pr(q

t

|o

t

) are computed using a trained neural network

Pr(q

t

) are relative frequencies of each triphone state as

determined by the forced Viterbi alignment of the training data

(38)

Hybrid Networks

The hybrid networks are trained with a minimum cross- entropy criterion

Advantages of hybrid systems:

1. No assumptions made about acoustic vectors being uncorrelated: Multiple inputs used from a window of time steps

2. Discriminative objective function

L(y, y ˆ ) = X

i

y

i

log(ˆ y

i

)

(39)

Summary of DNN-HMM acoustic models 


Comparison against HMM-GMM on different tasks

IEEE SIGNAL PROCESSING MAGAZINE [92] NOVEMBER 2012

and model-space discriminative training is applied using the BMMI or MPE criterion.

Using alignments from a baseline system, [32] trained a DBN-DNN acoustic model on 50 h of data from the 1996 and 1997 English Broadcast News Speech Corpora [37]. The DBN-DNN was trained with the

best-performing LVCSR features, specifically the SAT+DT features.

The DBN-DNN architecture con- sisted of six hidden layers with 1,024 units per layer and a final softmax layer of 2,220 context- dependent states. The SAT+DT feature input into the first layer used a context of nine frames.

Pretraining was performed fol- lowing a recipe similar to [42].

Two phases of fine-tuning were performed. During the first phase, the cross entropy loss was used. For cross entropy train- ing, after each iteration through the whole training set, loss is measured on a held-out set and the learning rate is annealed (i.e., reduced) by a factor of two if the held-out loss has grown or improves by less than a threshold of 0.01% from the previ- ous iteration. Once the learning rate has been annealed five times, the first phase of fine-tuning stops. After weights are learned via cross entropy, these weights are used as a starting point for a second phase of fine-tuning using a sequence crite- rion [37] that utilizes the MPE objective function, a discrimi- native objective function similar to MMI [7] but which takes into account phoneme error rate.

A strong SAT+DT GMM-HMM baseline system, which con- sisted of 2,220 context-dependent states and 50,000 Gaussians, gave a WER of 18.8% on the EARS Dev-04f set, whereas the DNN-HMM system gave 17.5% [50].

SUMMARY OF THE MAIN RESULTS FOR

DBN-DNN ACOUSTIC MODELS ON LVCSR TASKS

Table 3 summarizes the acoustic modeling results described above. It shows that DNN-HMMs consistently outperform GMM-HMMs that are trained on the same amount of data, sometimes by a large margin. For some tasks, DNN-HMMs also outperform GMM-HMMs that are trained on much more data.

SPEEDING UP DNNs AT RECOGNITION TIME

State pruning or Gaussian selection methods can be used to make GMM-HMM systems computationally efficient at recogni- tion time. A DNN, however, uses virtually all its parameters at every frame to compute state likelihoods, making it potentially much slower than a GMM with a comparable number of parame- ters. Fortunately, the time that a DNN-HMM system requires to recognize 1 s of speech can be reduced from 1.6 s to 210 ms, without decreasing recognition accuracy, by quantizing the weights down to 8 b and using the very fast SIMD primitives for fixed-point computation that are provided by a modern x86 cen- tral processing unit [49]. Alternatively, it can be reduced to 66 ms by using a graphics processing unit (GPU).

ALTERNATIVE PRETRAINING METHODS FOR DNNs

Pretraining DNNs as generative models led to better recognition results on TIMIT and subsequently on a variety of LVCSR tasks.

Once it was shown that DBN-DNNs could learn good acoustic models, further research revealed that they could be trained in many different ways. It is possible to learn a DNN by starting with a shallow neural net with a single hidden layer. Once this net has been trained discriminatively, a second hidden layer is interposed between the first hidden layer and the softmax output units and the whole network is again discriminatively trained. This can be continued until the desired number of hidden layers is reached, after which full backpropagation fine-tuning is applied.

This type of discriminative pretraining works well in prac- tice, approaching the accuracy achieved by generative DBN pre- training and further improvement can be achieved by stopping the discriminative pretraining after a single epoch instead of multiple epochs as reported in [45]. Discriminative pretraining has also been found effective for the architectures called “deep convex network” [51] and “deep stacking network” [52], where pretraining is accomplished by convex optimization involving no generative models.

Purely discriminative training of the whole DNN from ran- dom initial weights works much better than had been thought,

provided the scales of the initial weights are set carefully, a large amount of labeled training data is available, and minibatch sizes over training epochs are set appropri- ately [45], [53]. Nevertheless, gen- erative pretraining still improves test performance, sometimes by a significant amount.

Layer-by-layer generative pre- training was originally done using RBMs, but various types of

[TABLE 3] A COMPARISON OF THE PERCENTAGE WERs USING DNN-HMMs AND GMM-HMMs ON FIVE DIFFERENT LARGE VOCABULARY TASKS.

TASK HOURS OF

TRAINING DATA DNN-HMM GMM-HMM

WITH SAME DATA GMM-HMM

WITH MORE DATA

SWITCHBOARD (TEST SET 1) 309 18.5 27.4 18.6 (2,000 H)

SWITCHBOARD (TEST SET 2) 309 16.1 23.6 17.1 (2,000 H)

ENGLISH BROADCAST NEWS 50 17.5 18.8

BING VOICE SEARCH

(SENTENCE ERROR RATES) 24 30.4 36.2

GOOGLE VOICE INPUT 5,870 12.3 16.0 (22 5,870 H)

YOUTUBE 1,400 47.6 52.3

DISCRIMINATIVE PRETRAINING HAS ALSO BEEN FOUND EFFECTIVE

FOR THE ARCHITECTURES CALLED

“DEEP CONVEX NETWORK” AND

“DEEP STACKING NETWORK,” WHERE PRETRAINING IS ACCOMPLISHED BY CONVEX OPTIMIZATION INVOLVING

NO GENERATIVE MODELS.

Table copied from G. Hinton, et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition”, 
 IEEE Signal Processing Magazine, 2012.

Hybrid DNN-HMM systems consistently outperform GMM-

HMM systems (sometimes even when the latter is trained with

lots more data)

(40)

Neural Networks for ASR

Two main categories of approaches have been explored:

1. Hybrid neural network-HMM systems: Use DNNs to estimate HMM observation probabilities

2. Tandem system: NNs used to generate input features

that are fed to an HMM-GMM acoustic model

(41)

Tandem system

First, train a DNN to estimate the posterior probabilities of each subword unit (monophone, triphone state, etc.)

In a hybrid system, these posteriors (after scaling) would be used as observation probabilities for the HMM acoustic

models

In the tandem system, the DNN outputs are used as “feature”

inputs to HMM-GMM models

(42)

Bottleneck Features

Bottleneck Layer Output Layer

Hidden Layers

Input Layer

Use a low-dimensional bottleneck layer representation to extract features 


These bottleneck features are in turn used as inputs to HMM-GMM

models

References

Related documents

Dual Degree Proposed Course Structure Commerce Courses Offered By the Department of Accountancy & Law and Applied Business Economics.. Minutes of the meeting of the Board

The necessary set of data includes a panel of country-level exports from Sub-Saharan African countries to the United States; a set of macroeconomic variables that would

Percentage of countries with DRR integrated in climate change adaptation frameworks, mechanisms and processes Disaster risk reduction is an integral objective of

The Congo has ratified CITES and other international conventions relevant to shark conservation and management, notably the Convention on the Conservation of Migratory

SaLt MaRSheS The latest data indicates salt marshes may be unable to keep pace with sea-level rise and drown, transforming the coastal landscape and depriv- ing us of a

Although a refined source apportionment study is needed to quantify the contribution of each source to the pollution level, road transport stands out as a key source of PM 2.5

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

With an aim to conduct a multi-round study across 18 states of India, we conducted a pilot study of 177 sample workers of 15 districts of Bihar, 96 per cent of whom were