• No results found

Automatic Speech Recognition (CS753)

N/A
N/A
Protected

Academic year: 2022

Share "Automatic Speech Recognition (CS753)"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Preethi Jyothi July 24, 2017

Automatic Speech Recognition (CS753)

Lecture 1: Introduction to Statistical Speech Recognition

Automatic Speech Recognition (CS753)

(2)

Course Specifics

(3)

Pre-requisites

Ideal Background:

Completed one of “Foundations of ML (CS 725)” or “Advanced ML (CS 726)” or “Foundations of Intelligent Agents (CS 747)” at IITB or have completed an ML course elsewhere.

Also acceptable as pre-req:

Completed courses in EE that deal with ML concepts.

Experience working on research projects that are ML-based.

Less ideal but still works:

Comfortable with probability, linear algebra and multivariable

calculus. (Currently enrolled in CS 725.)

(4)

Main Topics:

• Introduction to statistical ASR

• Acoustic models

Hidden Markov models

Deep neural network-based models

• Pronunciation models

• Language models (Ngram models, RNN-LMs)

• Decoding search problem (Viterbi algorithm, etc.)

About the course (I)

(5)

About the course (II)

Course webpage: 


www.cse.iitb.ac.in/~pjyothi/cs753 Reading:

All mandatory reading will be freely available online. 


Reading material will be posted on the website.

Attendance: 


Strongly advised to attend all lectures given there’s no fixed textbook and a lot of the material covered in class will not be on the slides

Audit requirements:

Complete all three assignments and score ≥40% on each of them

(6)

Evaluation — Assignments

Grading: 3 assignments + 1 mid-sem exam making up 50% of the grade.

Format:

1. One assignment will be almost entirely programming-based.

The other two will contain a mix of problems to be solved by hand and programming questions.

2. Mid-sem and final exam will test concepts you’ve been taught in class.

Late Policy: 10% reduction in marks for every additional day past

the due date. Submissions closed three days after the due date.

(7)

Evaluation — Final Project

Grading: Constitutes 25% of the total grade. (Exceptional projects could get extra credit. Details posted on website.)

Team: 2-3 members. Individual projects are highly discouraged.

Project requirements:

• Discuss proposed project with me on or before August 17th.

• Intermediate deadline: Project progress report. Due on September 28th.

• Finally, turn in: 4-5 page final report about methodology &

detailed experiments

• Project presentation/demo

(8)

Evaluation — Final Project

About the Project:

• Could be implementation of ideas learnt in class, applied to real data (and/or to a new task)

• Could be a new idea/algorithm (with preliminary experiments)

• Excellent projects can turn into conference/workshop papers

(9)

Evaluation — Final Project

About the Project:

• Could be implementation of ideas learnt in class, applied to real data (and/or to a new task)

• Could be a new idea/algorithm (with preliminary experiments)

• Excellent projects can turn into conference/workshop papers

Sample project ideas:

• Detecting accents from speech

• Sentiment classification from voice-based reviews

• Language recognition from speech segments

• Audio search of speeches by politicians

(10)

Final Project Landscape (Spring ’17)

Automatic authorised ASR

Bird call Recognition

End-to-end Audio-Visual

Speech Recognition

InfoGAN for 
 music

Keyword spotting for continuous

speech Music Genre

Classification

Nationality detection from speech accents Sanskrit Synthesis

and Recognition

Speech synthesis

& ASR for Indic languages

Programming with speech-based

commands

Voice-based music

player Tabla bol

transcription

Singer 
 Identification

Speaker 
 Verification

Ad detection in live radio streams

Speaker Adaptation Emotion

Recognition from speech

Audio Synthesis Using LSTMs

Swapping instruments in

recordings

(11)

Evaluation — Final Exam

Grading: Constitutes 25% of the total grade.

Syllabus: Will be tested on all the material covered in the course.

Format: Closed book, written exam.

Image from LOTR-I; meme not original

(12)

Academic Integrity Policy

• Write what you know.

• Use your own words.

• If you refer to *any* external material, *always* cite your sources. Follow proper citation guidelines.

• If you’re caught for plagiarism or copying, penalties are much higher than simply omitting that question.

• In short: Just not worth it. Don’t do it!

Image credit: https://www.flickr.com/photos/kurok/22196852451

(13)

Introduction to Speech Recognition

(14)

Exciting time to be an AI/ML researcher!

Image credit: http://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

(15)

Lots of new progress

What is speech recognition? 


Why is it such a hard problem?

(16)

Automatic Speech Recognition (ASR)

• Automatic speech recognition (or speech-to-text) systems

transform speech utterances into their corresponding text form,

typically in the form of a word sequence

(17)

Automatic Speech Recognition (ASR)

• Automatic speech recognition (or speech-to-text) systems

transform speech utterances into their corresponding text form, typically in the form of a word sequence.

• Many downstream applications of ASR:

• Speech understanding: comprehending the semantics of text

• Audio information retrieval: searching speech databases

• Spoken translation: translating spoken language into foreign text

• Keyword search: searching for specific content words in speech

• Other related tasks include speaker recognition, speaker

diarization, speech detection, etc.

(18)

History of ASR

RADIO REX (1922)

(19)

History of ASR

SHOEBOX (IBM, 1962)

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

(20)

History of ASR

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

16 words Isolated word


recognition

HARPY (CMU, 1976)

(21)

History of ASR

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

16 words Isolated word


recognition

1000 words Connected

speech

HIDDEN MARKOV MODELS

(1980s)

(22)

History of ASR

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

16 words Isolated word


recognition

1000 words Connected

speech

10K+ words LVCSR systems

Cortana Siri

DEEP NEURAL NETWORK BASED SYSTEMS (>2010)

(23)

History of ASR

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

16 words Isolated word


recognition

1000 words Connected

speech

10K+ words LVCSR systems

1M+ words DNN-based

systems

What’s next?

(24)

Video from: https://www.youtube.com/watch?v=gNx0huL9qsQ

(25)

This can’t be blamed on ASR

(26)

ASR is the front-engine

Image credit: Stanford University

(27)

Why is ASR a challenging problem?

Variabilities in different dimensions:

Style: Read speech or spontaneous (conversational) speech? 


Continuous natural speech or command & control?


Speaker characteristics: Rate of speech, accent, prosody (stress, intonation), speaker age, pronunciation variability even when the same speaker speaks the same word

Channel characteristics: Background noise, room

acoustics, microphone properties, interfering speakers

Task specifics: Vocabulary size (very large number of

words to be recognized), language-specific complexity,

resource limitations

(28)

Noisy channel model

Encoder Noisy channel model Decoder

S C O W

Claude Shannon 


1916-2001

(29)

Noisy channel model applied to ASR

Speaker Acoustic processor Decoder

W O W *

Claude Shannon 


1916-2001 Fred Jelinek 


1932-2010

(30)

Statistical Speech Recognition

Let O represent a sequence of acoustic observations (i.e. O = {O 1 , O 2 , … , O t } where O i is a feature

vector observed at time t) and W denote a word

sequence. Then, the decoder chooses W * as follows:

W = arg max

W

Pr(W | O)

= arg max

W

Pr(O | W) Pr(W) Pr(O)

This maximisation does not depend on Pr(O). So, we have

W = arg max

W

Pr(O | W) Pr(W)

(31)

Statistical Speech Recognition

W = arg max

W

Pr(O | W) Pr(W)

Pr(O⎸W) is referred to as the “acoustic model”

Pr(W) is referred to as the “language model”

speechsignal

Acoustic
 Feature


Generator SEARCH

Acoustic
 Model

Language
 Model

word sequenceW *

O

(32)

Example: Isolated word ASR task

Vocabulary: 


10 digits (zero, one, two, …), 2 operations (plus, minus) Data: 


Speech utterances corresponding to each word sample from multiple speakers

Recall the acoustic model is Pr(O⎸W): direct estimation is 
 impractical (why?)

Let’s parameterize Pr α (O⎸W) using a Markov model with 


parameters α. Now, the problem reduces to estimating α.

(33)

Isolated word-based acoustic models

Image from: P. Jyothi, “Discriminative & AF-based Pron. models for ASR”, Ph.D. thesis, 2013

Transition probabilities denoted by a ij from state i to state j Observation vectors O t are generated from the probability 
 density b j (O t )

S t-1 S t S t+1

Ph t-1 Ph t Ph t+1

Tr t-1 Tr t

O t-1 O t O t+1

1 2 3

O 1 O 2 O 3 O 4 .... O T

0 4

b 1 ( ) b 2 ( ) b 3 ( )

a 01 a 12 a 23 a 34

a 11 a 22 a 33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectors O , it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (a j i ) and observation (or emission) probability distributions (b j (O i )) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of a j i . On reaching a state j , the observation vector at that state ( O j )

20

Model for


word “one”

(34)

Isolated word-based acoustic models

S t-1 S t S t+1

Ph t-1 Ph t Ph t+1

Tr t-1 Tr t

O t-1 O t O t+1

1 2 3

O 1 O 2 O 3 O 4 .... O T

0 4

b 1 ( ) b 2 ( ) b 3 ( )

a 01 a 12 a 23 a 34

a 11 a 22 a 33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectors O , it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (a j i ) and observation (or emission) probability distributions (b j (O i )) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of a j i . On reaching a state j , the observation vector at that state ( O j )

20

For an O={O 1 ,O 2 , …, O 6 } and a state sequence Q={0,1,1,2,3,4}:

Pr(O, Q | W = ‘one’) = a 01 b 1 (O 1 )a 11 b 1 (O 2 ) . . .

Model for
 word “one”

Pr(O | W = ‘one’) = X

Q

Pr(O, Q | W = ‘one’)

(35)

Isolated word recognition

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( )

a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word unitsQcorresponding to the word sequenceWand thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from stateito statejwith a probability ofaji. On reaching a statej, the observation vector at that state (Oj)

20

one:

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( )

a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word unitsQcorresponding to the word sequenceWand thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from stateito statejwith a probability ofaji. On reaching a statej, the observation vector at that state (Oj)

20

two:

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( )

a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word unitsQcorresponding to the word sequenceWand thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from stateito statejwith a probability ofaji. On reaching a statej, the observation vector at that state (Oj)

20

plus:

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( )

a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word unitsQcorresponding to the word sequenceWand thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from stateito statejwith a probability ofaji. On reaching a statej, the observation vector at that state (Oj)

20

minus:

.. . acoustic

featuresO

What are we assuming about Pr(W)?

Pr(O | W = ‘one’)

Pr(O | W = ‘two’)

Pr(O | W = ‘plus’)

Pr(O | W = ‘minus’)

Pick arg max

w Pr(O | W = w)

(36)

Isolated word recognition

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( )

a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word unitsQcorresponding to the word sequenceWand thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from stateito statejwith a probability ofaji. On reaching a statej, the observation vector at that state (Oj)

20

one:

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( )

a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word unitsQcorresponding to the word sequenceWand thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from stateito statejwith a probability ofaji. On reaching a statej, the observation vector at that state (Oj)

20

two:

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( )

a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word unitsQcorresponding to the word sequenceWand thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from stateito statejwith a probability ofaji. On reaching a statej, the observation vector at that state (Oj)

20

plus:

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( )

a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word unitsQcorresponding to the word sequenceWand thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from stateito statejwith a probability ofaji. On reaching a statej, the observation vector at that state (Oj)

20

minus:

.. . acoustic

featuresO

Pr(O | W = ‘one’)

Pr(O | W = ‘two’)

Pr(O | W = ‘plus’)

Pr(O | W = ‘minus’)

Is this approach

scalable?

(37)

Why are word-based models not scalable? 


Example

f ay v

f ow r

five four

one w ah n

“five four one nine”

Words ???

Phonemes n ay n

Pronunciation model

maps words to

phoneme sequences

(38)

Recall: Statistical Speech Recognition

W = arg max

W

Pr(O | W) Pr(W)

speechsignal

Acoustic
 Feature


Generator SEARCH

Acoustic
 Model

Language
 Model

word sequenceW *

O

(39)

Statistical Speech Recognition

W = arg max

W

Pr(O | W) Pr(W)

speechsignal

Acoustic
 Feature


Generator SEARCH

Acoustic
 Model (phonemes)

Language
 Model

word sequenceW *

O

Pronunciation


Model

(40)

Evaluate an ASR system

Quantitative metric: Error rates computed on an unseen test set by comparing W* (decoded output) against W ref (reference sentence) for each test utterance

• Sentence/Utterance error rate (trivial to compute!)

• Word/Phone error rate

(41)

Evaluate an ASR system

Word/Phone error rate (ER) uses the Levenshtein distance

measure: What are the minimum number of edits (insertions/

deletions/substitutions) required to convert W * to W ref ?

ER =

P N

j =1 Ins j + Del j + Sub j P N

j =1 ` j

Ins j , Del j , Sub j are number of insertions/deletions/substitutions in the j th ASR output

` j

On a test set with N instances:

is the total number of words/phones in the j th reference

(42)

NIST ASR Benchmark Test History

NIST STT Benchmark Test History

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

100%

10%

1%

4%

Speech Read

20k

5k 1k

Noisy Varied Microphones

Air Travel Planning Kiosk

Speech

2%

Conversational Speech

(Non-English)

Switchboard Cellular Switchboard II

CTS Arabic (UL)

CTS Mandarin (UL)0

CTS Fisher (UL) Switchboard

(Non-English)

News English 10X

Broadcast Speech

News English 1X News English unlimited

News Mandarin 10X News Arabic 10X

Meeting – MDM OV4

Meeting - IHM Meeting – SDM OV4

Meeting Speech

WER (in %)

Introduction/Overview Speech Synth Speech Reco Where is Speech Recognition? Speech Proc Summary Scratch

Why is the problem so difficult

Background noise, “cocktail party” e↵ect.

Channel di↵erences between training and testing: Head-mounted vs.

desktop mic: 10% vs. 70% WER for a speaker-trained commercial system

Read versus spontaneous speech:

yeah yeah I’ve noticed that that that’s one of the first things I do when I go home is I either turn on the t v or the radio it’s really weird

Play file://read2n.wav vs. file://spon2n.wav

Speaker variability: accent, dialect, situational (motherese), age (child vs. older speaker), and natural variability between humans (idiolect).

Prof. Je↵ Bilmes EE516/Spring 2013/Speech Proc – Lecture 1 - April 2nd, 2013 L1 F32/62 (pg.32/62)

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

(43)

Course Overview

speechsignal

Acoustic
 Feature


Generator SEARCH

Acoustic
 Model (phones)

Language
 Model

word sequenceW *

O

Pronunciation
 Model

Properties of speech

sounds Acoustic


Signal Processing

Hidden Markov Models Deep

Neural Networks Hybrid

HMM-DNN
 Systems Speaker

Adaptation

Ngram/RNN LMs

G2P/feature-

based models

(44)

Course Overview

speechsignal

Acoustic
 Feature


Generator SEARCH

Acoustic
 Model (phones)

Language
 Model

word sequenceW *

O

Pronunciation
 Model

Properties of speech

sounds Acoustic


Signal Processing

Hidden Markov Models Deep

Neural Networks Hybrid

HMM-DNN
 Systems Speaker

Adaptation

Ngram/RNN LMs

G2P/feature- based models

Search

algorithms

References

Related documents

This is computed by finding the link probabil- ity of w in the lattice using a forward–backward procedure, summing over all occurrences of w and then normalising so that all

Simple rules like these are used in both speech recognition and synthesis when we want to generate many pronunciations for a word; in speech recognition this is often used as a

(If you have edited steps/train_deltas.sh , please submit 2B/train_deltas.sh that we will use to train tied-state triphone HMMs.) Also submit a text file 2B/wer.txt that only

Please refer to Rabiner (1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs

The parts that are different than the simple GUS system are the dialog state tracker which maintains the current state of the dialog (which include the user’s most recent dialog

An encoder-decoder model includes an encoder, which reads in the input (grapheme) sequence, and a decoder, which generates the output (phoneme) sequence.. A typ- ical

We evaluate our DBRNN trained using CTC by decoding with several character-level language models: 5-gram, 7- gram, densely connected neural networks with 1 and 3 hidden layers

We evaluate our DBRNN trained using CTC by decoding with several character-level language models: 5-gram, 7- gram, densely connected neural networks with 1 and 3 hidden layers