• No results found

Discriminative Training

N/A
N/A
Protected

Academic year: 2022

Share "Discriminative Training "

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Preethi Jyothi

Discriminative Training

Lecture 22

CS 753

(2)

Recall: MLE for HMMs

Maximum likelihood estimation (MLE) sets HMM parameters so as to maximise the objective function:

where 


X1, …, Xi, … XN are training utterances


(Assume Mi is the HMM corresponding to the word sequence Wi of Xi and λ corresponds to the HMM parameters)

What are some conceptual problems with this approach?

L =

X

N

i=1

log P (X

i

| W

i

)

<latexit sha1_base64="DF6Dm0T/Qn7Mj5r0St2GUp2jFJc=">AAACHHicbVBLSwMxGMzWV62vVY9egkWol7LbCnopFL14EKlgH9Bdl2w224ZmHyRZoaz7Q7z4V7x4UMSLB8F/Y7btQVsHAsPM9yWZcWNGhTSMb62wtLyyulZcL21sbm3v6Lt7HRElHJM2jljEey4ShNGQtCWVjPRiTlDgMtJ1Rxe5370nXNAovJXjmNgBGoTUpxhJJTl63QqQHGLE0qsMNqAlksBJacPM7q6hxaIBbDkWU9d5CFZ6Dn3oOvTY0ctG1ZgALhJzRspghpajf1pehJOAhBIzJETfNGJpp4hLihnJSlYiSIzwCA1IX9EQBUTY6SRcBo+U4kE/4uqEEk7U3xspCoQYB66azKOIeS8X//P6ifTP7JSGcSJJiKcP+QmDMoJ5U9CjnGDJxoogzKn6K8RDxBGWqs+SKsGcj7xIOrWqWa/Wbk7KzfNZHUVwAA5BBZjgFDTBJWiBNsDgETyDV/CmPWkv2rv2MR0taLOdffAH2tcP9i2gpQ==</latexit>

(3)

Discriminative Learning

Discriminative models directly model the class posterior probability or learn the parameters of a joint probability model discriminatively so that classification errors are minimised

As opposed to generative models that attempt to learn a probability model of the data distribution

[Vapnik] “one should solve the (classification/recognition)

problem directly and never solve a more general problem as an intermediate step

[Vapnik]: V. Vapnik, Statistical Learning Theory, 1998

(4)

Discriminative Learning

Two central issues in developing discriminative learning methods:

1. Constructing suitable objective functions for optimisation

2. Developing optimization techniques for these objective functions

(5)

Estimating acoustic model parameters

If A: speech utterance and OA: acoustic features corresponding to the utterance A,

ASR decoding: Return the word sequence that jointly assigns the highest probability to OA

How do we estimate λ in Pλ(OA|W)?

MLE estimation

MMI estimation

MPE/MWE estimation

W = arg max

W

P (OA|W )P (W )

(6)

Estimating acoustic model parameters

If A: speech utterance and OA: acoustic features corresponding to the utterance A,

ASR decoding: Return the word sequence that jointly assigns the highest probability to OA

How do we estimate λ in Pλ(OA|W)?

MLE estimation

MMI estimation

MPE/MWE estimation

W = arg max

W

P (OA|W )P (W )

Covered in this class

(7)

Maximum mutual information (MMI) estimation:

Discriminative Training

MMI aims to directly maximise the posterior probability (criterion also referred to as conditional maximum

likelihood)

P(W) is the language model probability FMMI =

XN

i=1

log P (Wi|Xi)

=

XN

i=1

log P (Xi|Wi)P (Wi) P

Wj P (Xi|Wj )P (Wj )

<latexit sha1_base64="U6wzyn+kwXkjbtyvhNhBktQJYqs=">AAACjHicbVFNTxsxFPQutEBoaWiPvVhEIHJJdykSoCoSohIqB6pUIiRSnK68jjc4eD9kv0WNjH8N/4gb/6beJAcgPMnSaN7Msz0vLqTQEARPnr+y+u792vpGbfPDx61P9e3P1zovFeNdlstc9WOquRQZ74IAyfuF4jSNJe/Ftz+rfu+OKy3y7AqmBR+mdJyJRDAKjorqDySlcMOoNOc2MgT4PzCXlxfW4r02JrpMIyPaof37GxOZj3HHaaSbPqJ2vxeJ+34kmpiQ2ptikijKzHOLk987W7NTmZvWzD29aGKXVZOZatK0tajeCFrBrPAyCBeggRbVieqPZJSzMuUZMEm1HoRBAUNDFQgmua2RUvOCsls65gMHM5pyPTSzMC3edcwIJ7lyJwM8Y587DE21nqaxU1bR6de9inyrNyghOR4akRUl8IzNL0pKiSHH1WbwSCjOQE4doEwJ91bMbqiLENz+qhDC119eBtcHrfB76+DPYeP0bBHHOvqKdtA+CtEROkW/UAd1EfM2vG/esXfib/mH/g+/PZf63sLzBb0o//w/qx3Eow==</latexit>

(8)

Why is it called MMI?

Mutual information I(X, W) between acoustic data X and word labels W is defined as:

I(X, W ) = X

X,W

Pr(X, W ) log Pr(X, W ) Pr(X) Pr(W )

= X

X,W

Pr(X, W ) log Pr(W |X) Pr(W )

= H(W ) H(W |X)

where H(W) is the entropy of W and H(W|X) is the conditional entropy

(9)

Why is it called MMI?

Assume H(W) is given via the language model. Then, maximizing mutual information becomes equivalent to minimising conditional entropy

H(W |X) = 1 N

XN

i=1

log Pr(Wi|Xi)

= 1

N

XN

i=1

log Pr(Xi|Wi) Pr(Wi) P

W 0 Pr(Xi|W 0) Pr(W 0)

Thus, MMI is equivalent to maximizing:

FMMI =

XN

i=1

log P (Xi|Wi)P (Wi) P

Wj P (Xi|Wj )P (Wj )

<latexit sha1_base64="De9zt6sX0c4I0kX6Wxu1GRGEHHk=">AAACXnicbVFdSxtBFJ1da0yj0di+CL4MBiG+hF1b0JdAqCD6EEmhMYFsXGYns3Hi7Aczd0vDOH+yb+KLP6WzSR5a0wszHM49h3vnTJQLrsDzXhx368N2Zaf6sba7V98/aBx+uldZISkb0ExkchQRxQRP2QA4CDbKJSNJJNgweroq+8OfTCqepT9gkbNJQmYpjzklYKmwUQQJgUdKhL42oQ6A/QLd690agzs4UEUSat7xzcMdDkQ2w0EsCdV9KxR2xJSY1ijkz8OQn/Vb5W30yjMM52ZTNV+q5memFjaaXttbFt4E/ho00br6YeN3MM1okbAUqCBKjX0vh4kmEjgVzNSCQrGc0CcyY2MLU5IwNdHLeAw+tcwUx5m0JwW8ZP92aJIotUgiqyzDUO97Jfm/3riA+HKieZoXwFK6GhQXAkOGy6zxlEtGQSwsIFRyuyumj8RGCPZHyhD890/eBPfnbf9L+/z712b32zqOKjpGJ6iFfHSBuugG9dEAUfTqOE7N2XXe3Ipbdw9WUtdZez6jf8o9+gMh4bVx</latexit>

(10)

MMI estimation

Numerator: Likelihood of data given correct word sequence

Denominator: Total likelihood of the data given all possible word sequences

How do we compute this?

FMMI =

XN

i=1

log P (Xi|Wi)P (Wi) P

Wj P (Xi|Wj )P (Wj )

<latexit sha1_base64="De9zt6sX0c4I0kX6Wxu1GRGEHHk=">AAACXnicbVFdSxtBFJ1da0yj0di+CL4MBiG+hF1b0JdAqCD6EEmhMYFsXGYns3Hi7Aczd0vDOH+yb+KLP6WzSR5a0wszHM49h3vnTJQLrsDzXhx368N2Zaf6sba7V98/aBx+uldZISkb0ExkchQRxQRP2QA4CDbKJSNJJNgweroq+8OfTCqepT9gkbNJQmYpjzklYKmwUQQJgUdKhL42oQ6A/QLd690agzs4UEUSat7xzcMdDkQ2w0EsCdV9KxR2xJSY1ijkz8OQn/Vb5W30yjMM52ZTNV+q5memFjaaXttbFt4E/ho00br6YeN3MM1okbAUqCBKjX0vh4kmEjgVzNSCQrGc0CcyY2MLU5IwNdHLeAw+tcwUx5m0JwW8ZP92aJIotUgiqyzDUO97Jfm/3riA+HKieZoXwFK6GhQXAkOGy6zxlEtGQSwsIFRyuyumj8RGCPZHyhD890/eBPfnbf9L+/z712b32zqOKjpGJ6iFfHSBuugG9dEAUfTqOE7N2XXe3Ipbdw9WUtdZez6jf8o9+gMh4bVx</latexit>

(11)

Recall: Word Lattices

A word lattice is a pruned version of the decoding graph for an utterance

Acyclic directed graph with arc costs computed from acoustic model and language model scores

Lattice nodes implicitly capture information about time within the utterance214 Architecture of an HMM-Based Recogniser

HAVE

HAVE HAVE I

I MOVE

VERY VER

Y

SIL I

SIL

VEAL

OFTEN

OFTEN

SIL

SIL SIL SIL

FINE

IT VERY FAST

VER MOVE Y

HAVE IT

(a) Word Lattice

I HAVE IT VEAL FINE

- MOVE - VERY OFTEN

FAST

(b) Confusion Network

Time

FINE

Fig. 2.6 Example lattice and confusion network.

longer correspond to discrete points in time, instead they simply enforce word sequence constraints. Thus, parallel arcs in the confusion network do not necessarily correspond to the same acoustic segment. However, it is assumed that most of the time the overlap is sufficient to enable parallel arcs to be regarded as competing hypotheses. A confusion net- work has the property that for every path through the original lattice, there exists a corresponding path through the confusion network. Each arc in the confusion network carries the posterior probability of the corresponding word w. This is computed by finding the link probabil- ity of w in the lattice using a forward–backward procedure, summing over all occurrences of w and then normalising so that all competing word arcs in the confusion network sum to one. Confusion networks can be used for minimum word-error decoding [165] (an example of min- imum Bayes’ risk (MBR) decoding [22]), to provide confidence scores and for merging the outputs of different decoders [41, 43, 63, 72] (see Multi-Pass Recognition Architectures).

Image from [GY08]: Gales & Young, Application of HMMs in speech recognition, NOW book, 2008

(12)

MMI estimation

Numerator: Likelihood of data given correct word sequence

Denominator: Total likelihood of the data given all possible word sequences

How do we compute this?

Estimate by generating lattices, and summing over all the word sequences in the lattice

FMMI =

XN

i=1

log P (Xi|Wi)P (Wi) P

Wj P (Xi|Wj )P (Wj )

<latexit sha1_base64="De9zt6sX0c4I0kX6Wxu1GRGEHHk=">AAACXnicbVFdSxtBFJ1da0yj0di+CL4MBiG+hF1b0JdAqCD6EEmhMYFsXGYns3Hi7Aczd0vDOH+yb+KLP6WzSR5a0wszHM49h3vnTJQLrsDzXhx368N2Zaf6sba7V98/aBx+uldZISkb0ExkchQRxQRP2QA4CDbKJSNJJNgweroq+8OfTCqepT9gkbNJQmYpjzklYKmwUQQJgUdKhL42oQ6A/QLd690agzs4UEUSat7xzcMdDkQ2w0EsCdV9KxR2xJSY1ijkz8OQn/Vb5W30yjMM52ZTNV+q5memFjaaXttbFt4E/ho00br6YeN3MM1okbAUqCBKjX0vh4kmEjgVzNSCQrGc0CcyY2MLU5IwNdHLeAw+tcwUx5m0JwW8ZP92aJIotUgiqyzDUO97Jfm/3riA+HKieZoXwFK6GhQXAkOGy6zxlEtGQSwsIFRyuyumj8RGCPZHyhD890/eBPfnbf9L+/z712b32zqOKjpGJ6iFfHSBuugG9dEAUfTqOE7N2XXe3Ipbdw9WUtdZez6jf8o9+gMh4bVx</latexit>

(13)

MMI Training and Lattices

Computing the denominator: Estimate by generating lattices, and summing over all the words in the lattice

Numerator lattices: Restrict G to a linear chain acceptor representing the words in the correct word sequence.

Lattices are usually only computed once for MMI training.

HMM parameter estimation for MMI uses the extended Baum-Welch algorithm [V96,WP00]

Like HMMs, can DNNs also be trained with an MMI-type objective function? Yes!

[V96]:Valtchev et al., Lattice-based discriminative training for large vocabulary speech recognition, 1996 [WP00]: Woodland and Povey, Large scale discriminative training for speech recognition, 2000

(14)

Sequence-discriminative (MMI) Training of DNNs

In a hybrid system, DNNs are typically trained to optimise the cross-entropy objective function using SGD

We could maximise MMI instead, that is maximise the mutual information between the distributions of the

observation and word sequences

Compute gradients of the MMI objective function with respect to the activations at the output layer

[V et al.]:Vesely et al., Sequence discriminative training of DNNs, Interspeech 2013

(15)

MMI results on Switchboard

Switchboard results on two eval sets (SWB, CHE). Trained on 300 hours of speech. Comparing maximum likelihood

(ML) against discriminatively trained GMM systems and MMI-trained DNNs.

[V et al.]:Vesely et al., Sequence discriminative training of DNNs, Interspeech 2013

SWB CHE Total

GMM ML 21.2 36.4 28.8

GMM MMI 18.6 33.0 25.8

DNN CE 14.2 25.7 20.0

DNN MMI 12.9 24.6 18.8

(16)

Another Discriminative Training Objective: 


Minimum Phone/Word Error (MPE/MWE)

MMI is an optimisation criterion at the sentence-level. 


Change the criterion so that it is directly related to sub- sentence (i.e. word or phone) error rate.

MPE/MWE objective function is defined as:

where A(W, Wi) is phone/word accuracy of the sentence W 


given the reference sentence Wi i.e. the total phone count in Wi
 minus the sum of insertion/deletion/substitution errors of W

FMPE/MWE =

XN

i=1

P

W P (Xi|Wi)P (Wi)A(W, Wi) P

W 0 P (Xi|W 0)P (W 0)

<latexit sha1_base64="GakRkP2zq6Z86ohTdVCF2/l54XM=">AAACaXicbVFdSxwxFM1M60e3fqwWS2lfgovsCrLOaKG+CLbF4osyguss7KxDJpvRYOaD5E7pkgb6G/vWP9CX/olmdvdBXS8kOZxzLjc5SUrBFXjeH8d98XJhcWn5VeP1yuraenNj81oVlaSsRwtRyH5CFBM8Zz3gIFi/lIxkiWBhcv+11sPvTCpe5FcwLtkwI7c5TzklYKm4+SvKCNxRIvQ3E+sI2A/Q58Hp/nl4agw+xpGqsljzY9/cXOAolYTqCRXiwNqFHTQiptOP+c8w5rtBp94/d8K9+jRTqw7bZs7crr3tXdOImy2v600KzwN/BlpoVkHc/B2NClplLAcqiFID3ythqIkETgUzjahSrCT0ntyygYU5yZga6klSBu9YZoTTQtqVA56wDzs0yZQaZ4l11rmop1pNPqcNKkiPhprnZQUsp9NBaSUwFLiOHY+4ZBTE2AJCJbd3xfSO2DjBfk4dgv/0yfPg+qDrH3YPLj+2Tr7M4lhGH9A26iAffUIn6AwFqIco+uusOFvOW+efu+G+c99Pra4z63mDHpXb+g+2mbgN</latexit>

(17)

MPE/MWE training

The MPE/MWE criterion is a weighted average of the phone/word accuracy over all the training instances

A(W, Wi) can be computed either at the phone or word level for the MPE or MWE criterion, respectively

The weighting given by MPE/MWE depends on the number of incorrect phones/words in the string while MMI looks at

whether the entire sentence is correct or not FMPE/MWE =

XN

i=1

log

P

W P (Xi|Wi)P (Wi)A(W, Wi) P

W 0 P (Xi|W 0)P (W 0)

<latexit sha1_base64="RwIQMwRgHoZHxTrM+OwOmL5i0hc=">AAACbnicbVFdb9MwFHUCg63boAOJh00Ia9XUTEIlGUjwMmkDDfEyFKR1qdR0keM6nTXnQ/bNtMrzI3+QN34DL/wEnLYPY92VbB+de46ufZxWgivw/d+O++jxypOnq2ut9Y3NZ8/bWy/OVVlLyvq0FKUcpEQxwQvWBw6CDSrJSJ4KFqVXX5p+dM2k4mVxBtOKjXIyKXjGKQFLJe2fcU7gkhKhv5pEx8BuQJ+GJ+9OoxNj8CGOVZ0nmh8G5uI7jkU5wXEmCdUzPsKh9Qg7bUyMN0j4bZTw/dBr9mMvetucZi7VUdcsibuNtrtvWkm74/f8WeFlECxABy0qTNq/4nFJ65wVQAVRahj4FYw0kcCpYKYV14pVhF6RCRtaWJCcqZGexWXwnmXGOCulXQXgGXvXoUmu1DRPrbIJR93vNeRDvWEN2aeR5kVVAyvofFBWCwwlbrLHYy4ZBTG1gFDJ7V0xvSQ2TrA/1IQQ3H/yMjg/6AXvewc/PnSOPi/iWEU7aBd5KEAf0RH6hkLURxT9cbacbWfH+eu+cl+7b+ZS11l4XqL/yvX+Af9Vuf0=</latexit>

(18)

MPE results on Switchboard (GMMs)

Switchboard results on eval set SWB. Trained on 68 hours of speech. Comparing maximum likelihood (MLE) against discriminatively trained (MMI/MPE/MWE) GMM systems

SWB %WER redn

GMM MLE 46.6 -

GMM MMI 44.3 2.3

GMM MPE 43.1 3.5

GMM MWE 43.3 3.3

[V96]:Valtchev et al., Lattice-based discriminative training for large vocabulary speech recognition, 1996 [WP00]: Woodland and Povey, Large scale discriminative training for speech recognition, 2000

(19)

Sequence-discriminative training results on Switchboard (DNNs)

Switchboard results from DNNs trained on the full 300 hour training set, using different optimization criteria

SWB CHE Total

GMM MMI 18.6 33.0 25.8

DNN CE 14.2 25.7 20.0

DNN MMI 12.9 24.6 18.8

DNN sMBR 12.6 24.1 18.4

DNN MPE 12.9 24.1 18.5

[V et al.]:Vesely et al., Sequence discriminative training of DNNs, Interspeech 2013

(20)

CS-753 Concluding Remarks

(21)

Formalism:

Finite State Transducers

Topics covered

speech
 signal


Acoustic
 Feature
 Generator

SEARCH

Acoustic
 Model (phones)

Language
 Model

word sequence
 W*

O

Pronunciation
 Model

Properties of speech

sounds Acoustic


Signal

Processing

Ngram/RNN LMs

G2P models

Search algorithms

Hidden Markov

Models Deep NN-

based models Hybrid

HMM-DNN
 Systems Speaker

Adaptatio n

Discr.


Training

(22)

speech
 signal


Mapping acoustic signals
 to


word sequences

word sequence
 W*

O

End-to-end Neural
 Models

Ngram/RNN LMs

Topics covered

Speech
 Synthesis

(23)

Exciting time to do speech research

(24)

Called Hype Cycle for a reason…

SPEECH RECOGNITION

(25)

Robust to variations in age, accent and ability

Handling noisy real-life settings with many speakers (e.g.,

meetings, parties)

Handling pronunciation variability

Handling new languages/

dialects

Need to do more…

What’s next?

(26)

E.g.: ASR on accented speech

DESPITE THE JULY DECLINE TO <UNK> ITS

AUGUST REMAINED SEVEN POINT SEVEN OH CENT LEVEL THAT THE ABILITY OF THAT

DESPITE THE JULY DECLINE DURABLE GOODS ORDERS REMAINS SEVEN POINT SEVEN PERCENT

ABOVE THE YEAR EARLIER LEVEL

WER 21%

WER 3%

(27)

Speech interfaces

1.8M

2.8M

13.9M

100%

89%

61%

29M

88M

347M

(28)

Robust to variations in age, accent and ability

Handling noisy real-life settings with many speakers (e.g.,

meetings, parties)

Handling pronunciation variability

Handling new languages/

dialects

Fast (real-time) decoding using limited computational power/

memory

Faster training algorithms

Reduce duplicated effort across domains/languages

Reduce dependence on

language-specific resources

Train with less labeled data

Need to do more… … with less

What’s next?

(29)

Remaining Coursework

(30)

Participation Points

Six in-class mini-quizzes

Total points out of 20 


(Quiz 2 scaled to 4 points)

10 points gets full 5 participation points

[8-10) — 4
 [6-8) — 3
 [4-6) — 2
 [2-4) — 1


< 2 — 0

Quiz Points # of

responses

1 3 96

2 10 79

3 4 99

4 4 76

5 2 68

6 3 53

(31)

Final Exam Syllabus

1. WFST algorithms/WFSTs used in ASR

2. HMM algorithms/EM/Tied state Triphone models 3. DNN-based acoustic models

4. N-gram/Smoothing/RNN language models 5. End-to-end ASR (CTC, LAS, RNN-T)

6. MFCC feature extraction 7. Search & Decoding

8. HMM-based speech synthesis models 9. Multilingual ASR

10. Speaker Adaptation

11. Discriminative training of HMMs

Questions can be asked on any of the 11 topics listed above. You will be allowed a single A-4 cheat sheet of handwritten notes; content on both sides permitted.

(32)

Final Project

Deliverables

4-5 page final report:

Task definition, Methodology, Prior work, Implementation

Details, Experimental Setup, Experiments and Discussion, Error Analysis (if any), Summary

Short talk summarizing the project:

Each team will get 8-10 minutes for their presentation 
 and 5 minutes for Q/A

Clearly demarcate which team member worked on what part

(33)

Final Project Grading

Break-up of 20 points:

6 points for the report

4 points for the presentation

6 points for Q/A

4 points for overall evaluation of the project

(34)

Final Project Schedule

Presentations will be held on Nov 23rd and Nov 24th

The final report in pdf format should be sent to

pjyothi@cse.iitb.ac.in before Nov 24th

The order of presentations will be decided on a lottery basis and shared via Moodle before Nov 9th

References

Related documents

In order to provide the feed back regarding the transfer of technologies, research on the behavioural aspects of fishermen in capture and culture fisheries is needed. For

3. For any delays in delivery beyond delivery period mentioned in the purchase order, the bidder will be liable for penalties as mentioned in Clause 16.2. 4. i) The

But there was insis- tence from the library profession on the whole area of library science as developed for public library work with hardly any content needed for documentation.

This motivates us to ask the question: is it possible to design a receiver (linear or otherwise) wherein the design is done only once in each coherence interval, and applying

In this project we have worked on a method for scene identification using discriminative patterns.. We rank the image patches sampled from

A generative method learns an appearance model to represent the target and search for image regions with best matching scores as the results whereas discriminative methods

In this paper we have developed three types of acoustic models for Malayalam continuous speech recognition system and compared the performance of recognition accuracy of

The dataset for the period from July 2005 to January 2006 (training data) used for training the model such that assignment of classes based on PI threshold values, cate- gorization