• No results found

Statistical Speech Recognition

N/A
N/A
Protected

Academic year: 2022

Share "Statistical Speech Recognition"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Preethi Jyothi

Introduction to

Statistical Speech Recognition

Lecture 1

CS 753

(2)

Course Plan (I)

• Cascaded ASR System

- Acoustic Model ( AM )

- Pronunciation Model ( PM )

- Language Model ( LM )

• Weighted Finite State Transducers for ASR

AM: HMMs, DNN and RNN-based models

PM: Phoneme and Grapheme-based models

LM: Ngram models (+smoothing), RNNLMs

• Decoding Algorithms, Lattices

Speech Waveform

ay

k d

DECODER

Grammar (language) model Good prose

like is like is like a

2.5 0.7 1.2 0.8

NGRAMS SCORE

Pronunciation model

WORD PRON

good like

is

g uh d l ay k

ih z

good prose is like a windowpane Acoustic Analysis

Frame # Acoustic FeaturesAcoustic FeaturesAcoustic FeaturesAcoustic FeaturesAcoustic FeaturesAcoustic FeaturesAcoustic Features 1

2 3 4 5 :

Acoustic Features

Acoustic model

Figure 1.1: A standard automatic speech recognition architecture

fundamental statistical framework. Increase in computing power and the avail- ability of more speech material to train ASR systems have also contributed to the surge in ASR performance (see (Bourlard et al., 1996) about the latter).

As encouraging as this progress has been, we are far from reaching human capabilities of perceiving and recognizing speech. Humans are remarkably good at recognizing speech almost perfectly by various speakers with varying speaker

2

AM

PM

LM

(3)

Course Plan (II)

• End-to-end Neural Models for ASR

- CTC loss function

- Encoder-decoder Architectures with Attention

• Speaker Adaptation

• Speech Synthesis

• Recent Generative Models (GANs, VAEs) for Speech Processing

Check www.cse.iitb.ac.in/~pjyothi/cs753 for latest updates


Moodle will be used for assignment/project-related submissions


and all announcements

x1 x2 xT

hU

h1

x3 x4

h = (h1, . . . , hU)

y2 y3

hsosi

heosi

y2 y3

y4

yS 1

c1 c2

Speller

Listener

s1 s2

h h h

Fig. 1 : Listen, Attend and Spell (LAS) model: the listener is a pyra- midal BLSTM encoding our input sequence x into high level fea- tures h, the speller is an attention-based decoder generating the y characters from h.

consumes h and produces a probability distribution over character sequences:

h = Listen(x) (2)

P (y

i

| x, y

<i

) = AttendAndSpell(y

<i

, h) (3) Figure 1 depicts these two components. We provide more details of these components in the following sections.

2.1. Listen

The Listen operation uses a Bidirectional Long Short Term Memory RNN (BLSTM) [15, 16, 2] with a pyramidal structure. This modi- fication is required to reduce the length U of h , from T , the length of the input x, because the input speech signals can be hundreds to thousands of frames long. A direct application of BLSTM for the operation Listen converged slowly and produced results inferior to those reported here, even after a month of training time. This is presumably because the operation AttendAndSpell has a hard time extracting the relevant information from a large number of input time steps.

We circumvent this problem by using a pyramidal BLSTM (pBLSTM). In each successive stacked pBLSTM layer, we reduce the time resolution by a factor of 2. In a typical deep BLSTM architecture, the output at the i-th time step, from the j -th layer is computed as follows:

h

ji

= BLSTM(h

ji 1

, h

ji 1

) (4) In the pBLSTM model, we concatenate the outputs at consecutive steps of each layer before feeding it to the next layer, i.e.:

h

ji

= pBLSTM(h

ji 1

, h

h

j2i 1

, h

j2i+11

i

) (5)

In our model, we stack 3 pBLSTMs on top of the bottom BLSTM layer to reduce the time resolution 2

3

= 8 times. This allows the attention model (described in the next section) to extract the relevant information from a smaller number of times steps. In addition to reducing the resolution, the deep architecture allows the model to learn nonlinear feature representations of the data. See Figure 1 for a visualization of the pBLSTM.

The pyramidal structure also reduces the computational com- plexity. The attention mechanism in the speller U has a computa- tional complexity of O(U S ). Thus, reducing U speeds up learning and inference significantly. Other neural network architectures have been described in literature with similar motivations, including the hierarchical RNN [17], clockwork RNN [18] and CNN [19].

2.2. Attend and Spell

The AttendAndSpell function is computed using an attention- based LSTM transducer [10, 12]. At every output step, the trans- ducer produces a probability distribution over the next character conditioned on all the characters seen previously. The distribution for y

i

is a function of the decoder state s

i

and context c

i

. The de- coder state s

i

is a function of the previous state s

i 1

, the previously emitted character y

i 1

and context c

i 1

. The context vector c

i

is produced by an attention mechanism. Specifically,

c

i

= AttentionContext(s

i

, h) (6) s

i

= RNN(s

i 1

, y

i 1

, c

i 1

) (7) P (y

i

| x, y

<i

) = CharacterDistribution(s

i

, c

i

) (8) where CharacterDistribution is an MLP with softmax outputs over characters, and where RNN is a 2 layer LSTM.

At each time step, i, the attention mechanism, AttentionContext generates a context vector, c

i

encapsulating the information in the acoustic signal needed to generate the next character. The attention model is content based - the contents of the decoder state s

i

are matched to the contents of h

u

representing time step u of h , to generate an attention vector ↵

i

. The vectors h

u

are linearly blended using ↵

i

to create c

i

.

Specifically, at each decoder timestep i, the AttentionContext function computes the scalar energy e

i,u

for each time step u, using vector h

u

2 h and s

i

. The scalar energy e

i,u

is converted into a probability distribution over times steps (or attention) ↵

i

using a softmax function. The softmax probabilities are used as mixing weights for blending the listener features h

u

to the context vector c

i

for output time step i:

e

i,u

= h (s

i

), (h

u

) i (9)

i,u

= exp(e

i,u

) P

u0

exp(e

i,u0

) (10) c

i

= X

u

i,u

h

u

(11)

Image from: Chan et al., Listen, Attend and Spell: A NN for LVCSR, ICASSP 2016

(4)

Other Course Info

• Teaching Assistants (TAs):

- Vinit Unni (vinit AT cse)

- Saiteja Nalla (saitejan AT cse)

- Naman Jain (namanjain AT cse)

• TA office hours: Wednesdays, 10 am to 12 pm (tentative)
 Instructor 1-1: Email me to schedule a time

Readings:

- No fixed textbook. “Speech and Language Processing” by Jurafsky and Martin serves as a good starting point.

- All further readings will be posted online.

• Audit requirements: Complete all assignments/quizzes and score 40% ≥

(5)

Course Evaluation

• 3 Assignments OR 2 Assignments + 1 Quiz 35%

• At least one programming assignment

- Set up ASR system based on a recipe & improve said recipe

• Midsem Exam + Final Exam 15% + 25%

• Final Project 20%

• Participation 5%

Attendance Policy? Strongly advised to attend lectures.


Also, participation points hinges on it.

(6)

Academic Integrity Policy


Assignments/Exams

• Always cite your sources (be it images, papers or existing code repos).


Follow proper citation guidelines.

• Unless specifically permitted, collaborations are not allowed.

• Do not copy or plagiarise. Will incur significant penalties.

(7)

• Always cite your sources (be it images, papers or existing code repos).


Follow proper citation guidelines.

• Unless specifically permitted, collaborations are not allowed.

• Do not copy or plagiarise. Will incur significant penalties.

Academic Integrity Policy


Assignments/Exams

(8)

Final Project

• Projects can be on any topic related to speech/audio processing. 


Check website for abstracts from a previous offering.

• No individual projects and no more than 3 members in a team.

• Preliminary Project Evaluation: Short report detailing project statement, 
 goals, specific tasks and preliminary experiments

• Final Evaluation:

- Presentation (Oral or poster session, depending on final class strength)

- Report (Use ML conference style files & provide details about the project)

• Excellent Projects:

- Will earn extra credit that counts towards the final grade

- Can be turned into a research paper

SEP 1-7

NOV 7-14

(9)

#1: Speech-driven Facial Animation

https://arxiv.org/pdf/1906.06337.pdf, June 2019

Videos from: https://sites.google.com/view/facial-animation

(10)

#2: Speech2Gesture

https://arxiv.org/abs/1906.04160, CVPR 2019

Image from: http://people.eecs.berkeley.edu/~shiry/projects/speech2gesture/

(11)

#3: Decoding Brain Signals Into Speech

https://www.nature.com/articles/s41586-019-1119-1, April 2019

(12)

Introduction to ASR

(13)

Automatic Speech Recognition

• Problem statement: Transform a spoken utterance into a sequence of tokens (words, syllables, phonemes, characters)

• Many downstream applications of ASR. Examples:

- Speech understanding

- Spoken translation

- Audio information retrieval

• Speech demonstrates variabilities at multiple levels: Speaker style,

accents, room acoustics, microphone properties, etc.

(14)

History of ASR

RADIO REX (1922)

(15)

SHOEBOX (IBM, 1962)

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

History of ASR

(16)

History of ASR

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

16 words Isolated word 


recognition

HARPY (CMU, 1976)

(17)

History of ASR

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

16 words Isolated word 


recognition

1000 words Connected

speech

HIDDEN MARKOV MODELS 


(1980s)

(18)

History of ASR

1922 1932 1942 1952 1962 1972 1982 1992 2002 2012

1 word Freq.

detector

16 words Isolated word 


recognition

1000 words Connected

speech

10K+ words LVCSR

systems

Cortana Siri

DEEP NEURAL NETWORK BASED SYSTEMS (>2010)

(19)

How are ASR systems evaluated?

• Error rates computed on an unseen test set by comparing W* (decoded sentence) against W ref (reference sentence) for each test utterance

- Sentence/Utterance error rate (trivial to compute!)

- Word/Phone error rate

• Word/Phone error rate (ER) uses the Levenshtein distance measure: What are the minimum number of edits (insertions/deletions/substitutions)

required to convert W * to W ref ?

` j

ER =

P N

j =1 Ins j + Del j + Sub j P N

j =1 ` j

Ins j , Del j , Sub j are number of insertions/deletions/substitutions in the j th ASR output

On a test set with N instances:

is the total number of words/phones in the j th reference

(20)

Remarkable progress in ASR in the last decade

NIST STT Benchmark Test History

http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

100%

10%

1%

4%

Speech Read

20k

5k 1k

Noisy

Varied Microphones

Air Travel Planning Kiosk

Speech

2%

Conversational Speech

(Non-English)

Switchboard Cellular Switchboard II

CTS Arabic (UL)

CTS Mandarin (UL)0

CTS Fisher (UL) Switchboard

(Non-English)

News English 10X

Broadcast Speech

News English 1X News English unlimited

News Mandarin 10X News Arabic 10X

Meeting – MDM OV4

Meeting - IHM

Meeting – SDM OV4

Meeting Speech

WER (in %)

Introduction/Overview Speech Synth Speech Reco Where is Speech Recognition? Speech Proc Summary Scratch

Why is the problem so difficult

Background noise, “cocktail party” e↵ect.

Channel di↵erences between training and testing: Head-mounted vs.

desktop mic: 10% vs. 70% WER for a speaker-trained commercial system

Read versus spontaneous speech:

yeah yeah I’ve noticed that that that’s one of the first things I do when I go home is I either turn on the t v or the radio it’s really weird

Play file://read2n.wav vs. file://spon2n.wav

Speaker variability: accent, dialect, situational (motherese), age (child vs. older speaker), and natural variability between humans (idiolect).

Prof. Je↵ Bilmes EE516/Spring 2013/Speech Proc – Lecture 1 - April 2nd, 2013 L1 F32/62 (pg.32/62)

Image from: http://www.itl.nist.gov/iad/mig/publications/ASRhistory/

2018

(21)

Statistical Speech Recognition

Pioneer of ASR technology, Fred Jelinek (1932 - 2010): Cast ASR as a channel coding problem.

Let be a sequence of acoustic features corresponding to a speech signal.

That is, , where refers to a d-dimensional acoustic feature vector and is the length of the sequence.

O O = {O 1 , …, O T } O i ∈ ℝ d T

W* = arg max

W Pr(W | O)

= arg max

W Pr(O | W) Pr(W)

Acoustic Model

Language Model

Let W denote a word sequence. An ASR decoder solves the foll. problem:

(22)

Simple example of isolated word ASR

• Task: Recognize utterances which consist of speakers saying either “up"

or “down" or “left” or “right” per recording.

• Vocabulary: Four words, “up”, “down”, “left”, “right”

• Data splits

- Training data: 30 utterances

- Test data: 20 utterances

• Acoustic model: Let’s parameterize using a Markov model with parameters .

Pr θ (O | W)

θ

(23)

Word-based acoustic model

Transition probabilities going from state i to state j Probability of generating from state j

Compute

a ij

b j (O i ) → O i

Pr(O | "up" ) = ∑

Q

Pr(O, Q | "up" )

S t-1 S t S t+1

Ph t-1 Ph t Ph t+1

Tr t-1 Tr t

O t-1 O t O t+1

1 2 3

O 1 O 2 O 3 O 4 .... O T

0 4

b 1 ( ) b 2 ( ) b 3 ( )

a 01 a 12 a 23 a 34

a 11 a 22 a 33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectors O , it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (a j i ) and observation (or emission) probability distributions ( b j (O i ) ) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of a j i . On reaching a state j , the observation vector at that state (O j )

20

Efficient algorithm exists.

Will appear in a later class.

Model for

“up”

(24)

Isolated word recognition

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( ) a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Qcorresponding to the word sequence W and thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of aji. On reaching a state j, the observation vector at that state (Oj)

20

up

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( ) a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Qcorresponding to the word sequence W and thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of aji. On reaching a state j, the observation vector at that state (Oj)

20

down

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( ) a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Qcorresponding to the word sequence W and thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of aji. On reaching a state j, the observation vector at that state (Oj)

20

left

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( ) a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Qcorresponding to the word sequence W and thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of aji. On reaching a state j, the observation vector at that state (Oj)

20

right

acoustic 
 features


O

Pr(O | "up" )

Pr(O | "down" )

Pr(O | "left" )

Pr(O | "right" )

Compute arg max

w Pr(O | w)

(25)

Small tweak

• Task: Recognize utterances which consist of speakers saying either “up"

or “down" multiple times per recording.

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( ) a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequenceW and the language model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of aji. On reaching a state j, the observation vector at that state (Oj)

20

up

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( ) a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequenceW and thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of aji. On reaching a state j, the observation vector at that state (Oj)

20

down

(26)

Small tweak

• Task: Recognize utterances which consist of speakers saying either “up"

or “down" multiple times per recording.

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( ) a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequenceW and thelanguage model P(W) provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of aji. On reaching a state j, the observation vector at that state (Oj)

20

S

t-1

S

t

S

t+1

Ph

t-1

Ph

t

Ph

t+1

Tr

t-1

Tr

t

O

t-1

O

t

O

t+1

1 2 3

O

1

O

2

O

3

O

4

.... O

T

0 4

b

1

( ) b

2

( ) b

3

( ) a

01

a

12

a

23

a

34

a

11

a

22

a

33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Qcorresponding to the word sequence W and thelanguage model P(W)provides a prior probability forW.

Acoustic model: The most commonly used acoustic models in ASR systems to- day are Hidden Markov Models (HMMs). Please refer toRabiner(1989) for a com- prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with ideas that are largely applicable to systems today). HMMs are used to build prob- abilistic models for linear sequence labeling problems. Since speech is represented in the form of a sequence of acoustic vectorsO, it lends itself to be naturally mod- eled using HMMs.

The HMM is defined by specifying transition probabilities (aji) and observation (or emission) probability distributions (bj(Oi)) (along with the number of hidden states in the HMM). An HMM makes a transition from state i to state j with a probability of aji. On reaching a state j, the observation vector at that state (Oj)

20

Search within this graph

(27)

Small vocabulary ASR

• Task: Recognize utterances which consist of speakers saying one of 1000 words multiple times per recording.

• Not scalable anymore to use words as speech units

• Model using phones instead of words as individual speech units

- Phonemes are abstract, subword units that distinguish one word from another (minimal pair; e.g. “pan” vs. “can”)

- Phones are actually sounds that are realized and not language-specific units

• What's an obvious advantage of using phones over entire words? 


Hint: Think of words with zero coverage in the training data.

(28)

Architecture of an ASR system

speech
 signal


Acoustic
 Feature
 Generator

SEARCH

Acoustic


Model (phones)

Language
 Model word sequence


W *

O

Pronunciation


Model

(29)

Cascaded ASR ! End-to-end ASR

speech
 signal


Acoustic
 Feature


Generator word sequence


W *

O

Single end-to-end model that directly learns a mapping from speech to text

(30)

ASR Progress contd.

https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/

https://www.npr.org/sections/alltechconsidered/2016/08/24/491156218/voice-recognition-software-finally-beats-humans-at-typing-study-finds

AUG '17

https://venturebeat.com/2019/04/22/amazons-ai-system-could-cut-alexa-speech-recognition-errors-by-15/

MAR ‘19

AUG ‘16

(31)

What are some unsolved problems related to ASR?

• State-of-the-art ASR systems do not work well on regional accents, dialects

• Code-switching is hard for ASR systems to deal with

• How do we rapidly build competitive ASR systems for a new language?

Low-resource ASR and keyword spotting.

• How do we recognize speech from meetings where a primary speaker is

speaking amidst other speakers?

(32)

Next class: HMMs for Acoustic Modeling

References

Related documents

Fwd: TCS Campus Placement Final Results :Faculty of Engg, Dayal Bagh University, Agra : 12th, 13th &amp; 14th Sep, 2016.. dkc.foe &lt;dkc.foe@gmail.com&gt; Wed, Sep 14, 2016,

The Macroeconomic Policy and Financing for Development Division of ESCAP is undertaking an evaluation of this publication, A Review of Access to Finance by Micro, Small and Medium

motivations, but must balance the multiple conflicting policies and regulations for both fossil fuels and renewables 87 ... In order to assess progress on just transition, we put

 At various levels: logic level, switch level, circuit level.. 

Note: Submit Work order/ Contract Extract/ Certificate by the Company Secretary of the bidder indicating the scope of Work along with the Data Center

• Few of the fast moving electrons having velocity about one-tenth of the velocity of light may penetrate the surface atoms of the target material and knock out the tightly

The goal of this project is to apply a collaborative filtering algorithm in a website that can collect various users information from the user, such as Name, Email id, his

3.6., which is a Smith Predictor based NCS (SPNCS). The plant model is considered in the minor feedback loop with a virtual time delay to compensate for networked induced