• No results found

Feature Extraction Techniques

N/A
N/A
Protected

Academic year: 2022

Share "Feature Extraction Techniques"

Copied!
153
0
0

Loading.... (view fulltext now)

Full text

(1)

Astik Biswas

Department of Electrical Engineering

National Institute of Technology Rourkela

Feature Extraction Techniques

(2)

Recognition (ASR) Using Robust Wavelet-Based Feature Extraction Techniques

Dissertation submitted to the

National Institute of Technology Rourkela

in partial fullfillment of the requirements for the degree of

Doctor of Philosophy

in

Electrical Engineering

by

Astik Biswas

(Roll No: 512EE103)

under the supervision of

Prof. Prasanna Kumar Sahu

Department of Electrical Engineering, National Institute of Technology, Rourkela,

Rourkela-769 008, Orissa, India 2012-2015

(3)

July 2, 2016

Certificate of Examination

Roll Number: 512EE103 Name: Astik Biswas

Title of Dissertation: Performance Enhancement of Automatic Speech Recognition (ASR) Using Robust Wavelet-Based Feature Extraction Techniques.

We the below signed, after checking the dissertation mentioned above and the official record book (s) of the student, hereby state our approval of the dissertation submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy in Computer Science and Engineering at National Institute of Technology Rourkela. We are satisfied with the volume, quality, correctness, and originality of the work.

————————— —————————

<Co-Supervisor’s Name> <P.K.Sahu>

Co-Supervisor Principal Supervisor

————————— —————————

<D. Patra> <S.K. Behera>

Member (DSC) Member (DSC)

————————— —————————

<A. Sahoo> <Satyabrata Jit>

Member (DSC) Examiner

—————————

<A.K. Panda>

Chairman (DSC)

(4)

Prof./Dr. <Prasanna Kumar Sahu>

Associate Professor

July 2, 2016

Supervisor’s Certificate

This is to certify that the work presented in this dissertation entitled

”Performance Enhancement of Automatic Speech Recognition (ASR) Using Robust Wavelet-Based Feature Extraction Techniques” by ”Astik Biswas”, Roll Number 512EE103, is a record of original research carried out by him/her under my supervision and guidance in partial fulfillment of the requirements of the degree of Doctor of Philosophy in Electrical Engineering. Neither this dissertation nor any part of it has been submitted for any degree or diploma to any institute or university in India or abroad.

<Supervisor’s Signature>

<P. K. Sahu>

(5)

to

The Dreams and Sacrifices

of my beloved Wife (Adra),

Mom, Dad and Grandmother

(6)

I, <Astik Biswas>, Roll Number <512EE103> hereby declare that this dissertation entitled ”Performance Enhancement of Automatic Speech Recognition (ASR) Using Robust Wavelet-Based Feature Extraction Techniques”

represents my original work carried out as a

doctoral/postgraduate/undergraduate student of NIT Rourkela and, to the best of my knowledge, it contains no material previously published or written by another person, nor any material presented for the award of any other degree or diploma of NIT Rourkela or any other institution. Any contribution made to this research by others, with whom I have worked at NIT Rourkela or elsewhere, is explicitly acknowledged in the dissertation. Works of other authors cited in this dissertation have been duly acknowledged under the section ”Bibliography”.

I have also submitted my original research records to the scrutiny committee for evaluation of my dissertation.

I am fully aware that in case of any non-compliance detected in future, the Senate of NIT Rourkela may withdraw the degree awarded to me on the basis of the present dissertation.

July 2, 2016

NIT Rourkela Astik Biswas

(7)

Success in life is never attained single-handedly. Especially, a Ph.D. requires support, encouragement and motivation from the society. So I owe thanks to many people who support me in this long journey towards Ph.D.

My deepest sincere gratitude goes to my thesis supervisor, Prof (Dr.) Prasanna Kumar Sahu, for his guidance, advises, ideas, help and encouragement throughout my research work. I am very fortunate to get an opportunity to work under his supervision.

It has been an honour to work for and with him. Without his patient and inspiring guidance, encouragement and support this work would not have been possible. I also thank him for his insightful comments and suggestions that continually helped me to improve my understanding. I am confident that his trust and love make me work even harder in future.

I express my sincere gratitude to my doctoral committee members, Prof. S. K.

Behera and Prof. A. Sahoo of Department of Electronics and Communication Engineering; Prof. A. K. Panda and Prof. (Mrs.) D. Patra of Department of Electrical Engineering for taking the time to review my work, asking questions and giving constructive suggestions. I am very much obliged to the Director (Prof. S. K. Sarangi), two Heads of Electrical Engineering Department in my tenure of work (Prof. A. K.

Panda and Prof. J. K. Satapathy) for providing all possible facilities towards this work. Thanks to all other faculty and staff members in the department.

Words are not enough to express my sincere gratitude and respect to my M.E.

supervisor Prof. M. Chandra, Department of Electronics and Communication Engineering, Birla Institute of Technology, Mesra for his insightful comments, suggestions and motivation at various occasions. Since 2008, his hard working attitude, depth of knowledge and high dedication towards research have inspired me to mature into a better researcher. I am extremely grateful to Prof. O. Farooq, Professor, Aligarh Muslim University, Aligarh for his encouragement in the field of Research, which motivated me.

I would like to mention four names of our department: Mr. K. P. Pradhan, Mr. S. K.

Mahaparta, Mr. D. Rout and Mr. Y. Patnaik, who have spent countless hours assisting me with research work and for timely troubleshooting and supporting throughout the process. Thanks for your friendship, and those tea-times, party which helped break the monotony of research and to reset the mind for next morning. I would like to thank two very close buddy of mine Mr. A. Bhowmick and Mr. W. Bhowmik for their mental support and motivation. Former helped me lot in simulation and database collection.

Doing a Ph.D. is a long journey and full with ups and downs. I consider myself lucky to have a great group of friends and colleagues who made my life at NIT, Rourkela very enjoyable. I feel blessed to have made so many good buddies : Mrs. S. Parija, Ms. J.

Mishra, Mr. D. Panigrahy, Mr. S. Nanda, Mr. D. De, Mr. M. Rakshit, Mr. D. Singh, Ms. S. Panda, Ms. S. Pradhan, Mr. S. Gupta, Mrs. S. Mahapatra, Mrs. P. Mahanta, Mr. S. Mahanta, & Mr. P. K. Sahu. Further, Madam (Mrs.) B. Sahu holds a special

(8)

made by her for our get together. I am sure that I have missed a lot of names, so please be indulgent.

There is no word in which I can pay my tribute to my grandmother Late Mrs. Sabita Rani Biswas for that entire she had done for me. My beloved parents Mr. P. K. Biswas and Mrs. M. Biswas supported me morally, emotionally and financially during these years. I adore my better half Mrs. Adra especially for bearing a disproportionate share of the work to maintain our home and family. Her love and patience supported me through the difficult times. Her smile shatters away my everyday tiredness and fills me with joyfulness.

I would like to thank my father and mother-in-law Mr. T. Bishnu and Mrs. S.

Bishnu for whose encouragement has helped keep me working over the years. I would like to thank my other family members Mr. A. Ghosh, Ms. A. Ghosh, Mr. R. Bishnu, Mr. S. Biswas, Mrs. A. Ghosh, Mr. M. Biswas, Mr. R.N. Biswas, Mr. P. Biswas, Mr.

S. Sinha, Mrs. M. Sarkar, Mr. S. Sarkar, and Mr. S. Bose. They supported me morally and emotionally during this journey towards Ph.D.

I also would like to show my sincere gratitude towards the IMS Engineering College, Ghaziabad for granting me the study leave to pursue Ph.D. at NIT Rourkela.

It is also necessary to acknowledge the various sources of funding which provided support during this Ph.D. –without it the research would not have been possible. In particular, I would like to thank the Ministry of Human Resources and Development (MHRD), Govt. of India for providing funding and scholarship to carry out the research work.

At last but not the least, I would like to thank Mr. Lary Page, founder of the Google–without it the research would not have been possible to carry. I also would like to thank the developers of the Windows, Linux, HTK, and Matlab.

I pray for their happiness and good health, and I dedicate my work to them in the most sincere way I can think.

Finally, I thank God, who has been kind enough to me to perform this work.

Now that I have found the gold mine, I would keep excavating it and make the most of it. It seems this thesis comes to the end. It is just the preliminary of my exploration.

Astik Biswas

(9)

In this era of smart applications, Automatic Speech Recognition (ASR) has established itself as an emerging technology that is becoming popular day by day.

However, the accuracy and reliability of these systems are somehow restricted by the acoustic conditions such as background noise, and channel noise. Thus, there is a considerable gap between human-machine communication, due to their lack of robustness in the composite auditory scene. The objective of this thesis is to enhance the robustness of the system in the complex auditory environment by developing new front-end acoustic feature extraction techniques. Pros and cons of the different techniques are also highlighted.

In the recent years, wavelet based acoustic features have been popular for speech recognition applications. The wavelet transform is an excellent tool for the time–frequency analysis with good signal denoising property. A new auditory-based Wavelet Packet (WP) features are proposed to enhance the system performance across different types of noisy conditions. The design and development of the proposed technique is carried out in such a way that it mimics the frequency response of human ear according to the Equivalent Rectangular Bandwidth (ERB) scale. In the subsequent chapters, the further developments of the proposed technique are discussed by using the Sub-band based Periodicity and Aperiodicity DEcomposition (SPADE) and harmonic analysis. The TIMIT (English) and CSIR-TIFR (Hindi) phoneme recognition tasks are carried out to evaluate the performance of proposed technique.

The simulation results demonstrate the potentiality of proposed techniques to enhance the system accuracy in a wide range of SNR.

Further, visual modality plays a vital role in computer vision systems when the acoustic modality is disturbed by the background noise. However, most of the systems rarely addressed the visual domain problems to make it work in real world conditions.

Multiple-camera protocol ensures more flexibility to the system by allowing speakers to move freely. In the last chapter, consideration is given to Audio-Visual Speech Recognition (AVSR) implementation in vehicular environments, which resulted in one novel contribution–the one-way Analysis Of Variance (ANOVA)-based camera fusion strategy. Multiple-camera fusion technique is an imperative part of multiple cameras computer vision applications. The ANOVA-based approach is proposed to study the relative contribution of each camera for AVSR experiments in-vehicle environments.

The four-cam automotive audio-visual corpus is used to investigate the performance of the proposed technique.

Speech is a primary medium of communication for humans, and various speech-based applications can work reliably only by improving the performance of ASR across different environments. In the modern era, there is a vast potential and immense possibility of using speech effectively as a communication medium between human and machine. The robust and reliable speech technology ensures people to experience the full benefits of Information and Communication Technology (ICT).

(10)

Acronym Description

ASR Automatic Speech Recognition

DARPA Defense Advanced Research Projects Agency IVR Interactive Voice Response

ADC Analogue to Digital Converter LPC Linear Predictive Coding

MFCC Mel Frequency Cepstral Coefficients PLP Perceptual Linear Prediction

GFCC Gammatone Filter Cepstral Coefficients

HMM Hidden Markov Model

SD Speaker Dependent

SI Speaker Independent

VT Vocal Tract

STFT Short Time Fourier Transform AVSR Audio Visual Speech Recognition ERB Equivalent Rectangular Bandwidth

WT Wavelet Transform

WP Wavelet Packet

WERBC ERB like Wavelet Cepstral Features

WERB-SPADE ERB like Wavelet Sub-band Periodic and Aperiodic DEcomposition

HEF Harmonic Energy Features

ANOVA ANalysis of VAriance FIR Finite Impulse Response EMs Expectation Maximization

DP Dynamic Programming

DTW Dynamic Time Warping

ANN Artificial Neural Network

HTK Hidden Markov Toolkit

CMU Carnegie Mellon University

TI Texas Instrument

MIT Massachusetts Institute of Technology

CI Context Independent

CD Context Dependent

EARS Effective Affordable Reusable Speech-to-text MLLR Maximum Likelihood Linear Regression SMAR Structured Maximum a Posteriori TIFR Tata Institute of Fundamental Research

CEERI Central Electronics Engineering Research Institute CSIR Council of Scientific & Industrial Research

FFT Fast Fourier Transform

continued on the next page

(11)

DCT Discrete Cosine Transform MRA Multi Resolution Analysis CWT Continuous Wavelet Transform DWT Discrete Wavelet Transform PRA Phoneme Recognition Accuracy

PER Phoneme Error Rate

SNR Signal to Noise Ratio

SPADE Sub-band Periodic and Aperiodic DEcomposition SPADE-QUEEN SPADE-freQUEncy domain ENhancement

RTF Real-Time Factor

HC Harmonic Component

HLDA Heteroscedastic Linear Discriminant Analysis

AF Articulatory Features

FVO Variable Order Formant

AVICAR Audio Visual In CAR

CL Central Left Camera

CR Central Right Camera

SL Side Left Camera

SR Side Right Camera

SHMM Synchronous Hidden Markov Model

ROI Region Of Interest

VJ VJ Algorithm

MI Middle Integration

LI Late Integration

IDL Idle Condition

35U 35mph with window Up

35D 35mph with window Down

55U 55mph with window Up

55D 55mph with window Down

SVM Support Vector Machine

AAM Active Appearance Model

ZM Zernike Moments

(12)

Symbol Description

G(z) Glottal Filter u(n) Excitation Signal H(z) Vocal Tract Filter

S Silence

U Unvoiced

V Voiced

t Time

s(n) Speech Signal

f Frequency

fc Central Frequency w(l) Hamming Window cm mth Cepestral Coefficient fmel Mel Frequency

fBark Bark Frequency fERB ERB Frequency

g(t) Impulse Response of Gammatone Filter

∆ Delta Features

∆∆ Delta Delta Features

aij State Transition Probability from State itoj

N Number of States

M Number of Mixtures

V Observation Symbol Probability Distribution

O Observation Sequence

π Initial State Distribution

λ Model Parameter

continued on the next page

(13)

Symbol Description

αt(i) Forward Variable βt(i) Backward Variable ψ(t) Mother Wavelet φ(t) Scaling Function g[.] High-pass Filter h[.] Low-pass Filter

Al(j) Approximation Coefficient Dl(j) Detail Coefficient

|Bi(ω, k)| Residual Power Spectrum api(k) Aperiodic Feature

pi(k) Periodic Feature

Sv(t) Harmonic Model of Voiced Frame

T Pitch Period

αA Audio Weighting Parameter

αC Central Camera Weighting Parameter αL Side Left Camera Weighting Parameter

Z¯ Global Mean

σ2i Variance

H0 Null Hypothesis

WAcc Word Recognition Accuracy

θ Wave Incident Angle

C Sound Velocity

d Distance Between Two Adjacent Microphone

(14)

Certificate of Examination i

Supervisor’s Certificate ii

Dedication iii

Declaration of Originality iv

Acknowledgment v

Abstract vii

List of Acronyms viii

List of Symbols x

List of Figures xv

List of Tables xviii

1 Prologue 1

1.1 Overview of Automatic Speech Recognition . . . 3

1.1.1 Application of Speech Recognition . . . 4

1.1.2 Challenges of Speech Recognition . . . 4

1.2 Motivation of the Work . . . 5

1.3 Objective of the Work . . . 6

1.4 Organization of the Thesis . . . 6

1.5 Summary . . . 8

2 Literature Review on Speech Recognition 9 2.1 Introduction. . . 10

2.1.1 Chapter Organization . . . 10

2.2 Speech Production . . . 10

2.2.1 Speech Representation . . . 12

2.2.2 Phonetics of the Language . . . 13

2.3 Speech Feature Extraction Techniques . . . 14

2.3.1 Preprocessing . . . 15

2.3.2 Framing and Windowing. . . 15

2.3.3 Feature Extraction Technique . . . 15

2.3.4 Dynamic Features and Normalization . . . 22

2.4 Hidden Markov Model (HMM) as a Speech Recognizer . . . 23

2.4.1 The Urn and Ball Model. . . 24

2.4.2 Elements of an HMM . . . 25

(15)

2.5 Literature Review . . . 31

2.6 Research Methodology . . . 33

2.6.1 Research Design . . . 33

2.6.2 Database Selection . . . 34

2.6.3 Simulation Platform . . . 37

2.7 Summary . . . 37

3 Auditory ERB Scale like Wavelet Packet Decomposition 38 3.1 Introduction. . . 39

3.1.1 Chapter Organization . . . 39

3.2 Theoretical Background . . . 39

3.2.1 Continuous Wavelet Transform (CWT) . . . 40

3.2.2 Discrete Wavelet Transform (DWT) . . . 41

3.2.3 Wavelet Packet Decomposition . . . 43

3.3 WP-based Acoustic Feature Extraction. . . 44

3.3.1 Mel Filter like WP Decomposition . . . 44

3.3.2 Proposed Auditory ERB like Wavelet Acoustic Features . . . 45

3.4 Experimental Framework . . . 48

3.5 Results and Discussion . . . 49

3.5.1 TIMIT English Corpus . . . 49

3.5.2 CSIR-TIFR Hindi Corpus . . . 53

3.6 Summary . . . 56

4 Auditory ERB like WP Sub-band Periodic and Aperiodic Decomposition 57 4.1 Background . . . 58

4.1.1 Chapter Organization . . . 59

4.2 WP-based SPADE Analysis . . . 59

4.2.1 WERB-SPADE Feature Extraction Technique. . . 59

4.2.2 WMEL-SPADE Feature Extraction Technique . . . 61

4.3 Results and Discussions . . . 61

4.3.1 TIMIT English Corpus . . . 62

4.3.2 CSIR-TIFR Hindi Corpus . . . 66

4.4 Computational Time . . . 69

4.5 Summary . . . 70

5 Auditory ERB-like WP Sub-band Harmonic Energy Features 72 5.1 Background . . . 73

5.1.1 Chapter Organization . . . 74

5.2 Proposed WP Sub-band Harmonic Feature Extraction . . . 74

5.2.1 Harmonic Model of Voiced Speech Signal . . . 74

5.2.2 Sub-band Harmonic Component Extraction Using the Sine Function Convolution Technique . . . 74

(16)

5.3 Results and Discussions . . . 77

5.3.1 TIMIT English Corpus . . . 78

5.3.2 CSIR-TIFR Hindi Corpus . . . 82

5.4 Computational Time . . . 86

5.5 Summary . . . 86

6 Audio-Visual Speech Processing with Multiple Cameras in Car Environment 88 6.1 Background . . . 89

6.1.1 Related Work and Motivation . . . 90

6.1.2 Chapter Organization . . . 91

6.2 AVICAR Corpus Evaluation Protocol . . . 91

6.3 Acoustic and Visual Feature extraction. . . 92

6.3.1 Visual Feature Extraction . . . 92

6.3.2 Acoustic Feature Extraction . . . 93

6.3.3 Audio and Video Features Synchronization . . . 93

6.4 Experimental Framework-I: SHMM Fusion. . . 93

6.4.1 Audio and Visual Speech HMM Modeling . . . 93

6.4.2 Performance Evaluation . . . 95

6.5 Experimental Framework-II: ANOVA Fusion . . . 95

6.5.1 Statistical Model and Null Hypothesis Test . . . 96

6.5.2 Camera Fusion . . . 99

6.5.3 Audio and Visual Modality Fusion . . . 99

6.6 Results and Discussions . . . 100

6.6.1 Single Camera Experiment . . . 101

6.6.2 Multiple Camera VSR Experiment . . . 102

6.6.3 Multiple Camera AVSR Experiment . . . 104

6.7 Summary . . . 106

7 Conclusion and Future Scope 107 7.1 Summary of the Work . . . 108

7.1.1 Findings . . . 109

7.1.2 Shortcomings . . . 110

7.2 Future Scope . . . 110

Bibliography 112

Dissemination 123

Appendix A 126

Appendix B 129

Author’s Biography 131

(17)

1.1 Important developments and milestones of ASR technology over last seven

decades. . . 2

1.2 Typical components of an ASR system. . . 3

1.3 Organization of the Thesis . . . 6

2.1 Schematic diagram of the human speech production and perception process. A: Speech Formulation, B: Speech Production, C: Acoustic Wave in Air, D: Speech Perception, E: Speech Comprehension . . . 10

2.2 Human vocal tract . . . 11

2.3 Discrete time speech production model . . . 11

2.4 Three-state representation of the speech signal . . . 12

2.5 Spectrogram of a speech signal shown . . . 13

2.6 An example of phonetic transcription of a speech signal . . . 13

2.7 Phoneme classification chart. . . 14

2.8 Feature extraction steps . . . 14

2.9 Blocking of speech into overlapping frames. . . 15

2.10 Equivalent model of human vocal tract . . . 16

2.11 LPC feature calculation . . . 16

2.12 Mapping of frequency from Hz to Mel . . . 17

2.13 24 band mel scale filter bank . . . 18

2.14 MFCC feature calculation . . . 18

2.15 PLP feature calculation . . . 19

2.16 Bark scale filterbank . . . 20

2.17 Mapping of frequency from Hz to Bark. . . 20

2.18 GFCC feature calculation . . . 21

2.19 Gammatone filterbank . . . 22

2.20 Mapping of frequency from Hz to ERB . . . 22

2.21 A simple five state left-to-right HMM with three emitting and two non emitting state . . . 24

2.22 The Urn and Ball model . . . 24

2.23 Sequence of operation needed to calculate the forward variable . . . 27

2.24 Sequence of operation needed to calculate the backward variableβt(i) . 28 2.25 The sequence of operations required for the joint event computation when the system is in stateSi at Time t, and in stateSj at timet+ 1 . . . . 30

2.26 Block diagram of the ASR. . . 34

(18)

2.28 Audio and video recording setup in AVICAR database.. . . 37

3.1 Comparison between STFT and WT . . . 41

3.2 Subband decomposition using DWT . . . 42

3.3 Wavelet packet decomposition . . . 43

3.4 24 sub-band WP tree-based on MEL scale. . . 44

3.5 Steps of acoustic WMFCC feature extraction technique. . . 45

3.6 24 sub-band WP tree-based on ERB scale.. . . 46

3.7 Typical hearing frequency distribution of human speech . . . 47

3.8 Steps of WERBC acoustic feature extraction technique . . . 48

3.9 A part of tree-based triphone clustering technique for triphone/ah/; leaf nodes (gray) represents the final cluster. . . 49

3.10 Phoneme Recognition accuracy (%) with different types of noise at different SNR levels. . . 51

3.11 Comparative performance of the WERBC features over the MFCC, GFCC and WMFCC features in terms of relative gain/loss (%) . . . 53

3.12 Phoneme Recognition accuracy (%) with different types of noise at different SNR levels. . . 55

3.13 Comparative performance of the WERBC features over the MFCC, GFCC and WMFCC features in terms of relative gain/loss (%) . . . 56

4.1 Block diagram of the WERB-SPADE analysis . . . 59

4.2 Block diagram of the WMEL-SPADE analysis. . . 61

4.3 Phoneme Recognition accuracy (%) with different types of noise at different SNR levels. . . 63

4.4 Comparative performance of the WERBC-SPADE features over the other features in terms of relative gain/loss (%) . . . 65

4.5 Phoneme Recognition accuracy (%) with different types of noise at different SNR levels. . . 68

4.6 Comparative performance of the WERBC-SPADE features over the other features in terms of relative gain/loss (%) . . . 69

5.1 (a): 20 ms to 40 ms segment of Hindi vowel ‘/A/’ from the CSIR-TIFR Hindi database: (b)-(g) 1st to 6th harmonic components . . . 76

5.2 Block diagram of the WERBC-HEF feature extraction technique . . . . 77

5.3 Block diagram of the WMFCC-HEF feature extraction technique . . . . 78

5.4 Phoneme Recognition accuracy (%) with different types of noise at different SNR levels. . . 80

5.5 Comparative performance of the WERBC-SPADE features over the other features in terms of relative gain/loss (%) . . . 82

5.6 Phoneme Recognition accuracy (%) with different types of noise at different SNR levels. . . 84

(19)

6.1 Example of captured video frames of a female speaker in AVICAR database while car was traveling in four different driving conditions. . . 92 6.2 Audio visual feature synchronization . . . 94 6.3 Five-stream audio-visual SHMM.Aand V represents acoustic and visual

HMM respectively . . . 94 6.4 Overview of forming the five-stream audio visual SHMM . . . 95 6.5 H0 test between each pair of camera mounted in the AVICAR database 97 6.6 Gender specific H0 test between each pair of camera mounted in the

AVICAR database . . . 98 6.7 H0 test across the visual features across all camera stream . . . 99 6.8 Proposed process flow diagram of the AVICAR AVSR experiment . . . 100 6.9 Overview of proposed ANOVA-based multiple camera AVSR system . . 100 6.10 Performance of single camera VSR system across all driving

conditions(Average of all testing folds of the AVICAR protocol) . . . 101 6.11 Avg. word recognition accuracy in all driving conditions across all

cameras (Gender specific test results are shown by dashed lines) . . . . 101 6.12 Avg. Word recognition accuracy of two-camera VSR across all testing folds.103 6.13 Avg. word recognition accuracy of four-camera VSR in all driving

conditions across all cameras, averaged across all testing folds of the AVICAR Protocol . . . 103 6.14 WAcc(%) of five-stream AVICAR AVSR experiments across all noise

conditions.. . . 105 6.15 Avg. word recognition accuracy of five-stream AVSR in all driving

conditions across all cameras, averaged across all testing folds of the AVICAR Protocol . . . 105

(20)

3.1 Comparison of frequency bands of 24 uniformly spaced Mel scale and

wavelet packet sub-band . . . 45

3.2 Comparison of frequency bands of 24 uniformly spaced ERB scale and wavelet packet sub-band . . . 47

3.3 TIMIT train and test set . . . 49

3.4 PRA(%) of different types of feature extraction techniques . . . 50

3.5 Detailed phoneme error rate (PER) with clean test set along with phoneme error distribution (%) given in parenthesis . . . 50

3.6 Broad phonetic group of all the 39 TIMIT phoneme . . . 52

3.7 Detailed PRA(%) for different phoneme classes with clean data . . . 52

3.8 CSIR-TIFR train and test set . . . 53

3.9 PRA(%) of different types of feature extraction techniques . . . 53

3.10 Detailed PER with clean test set along with phoneme error distribution (%) given in parenthesis . . . 54

3.11 Broad phonetic group of all the 66 CSIR-TIFR Hindi phoneme . . . 55

3.12 Detailed PRA(%) for different phoneme classes with clean data . . . 55

4.1 PRA(%) of different types of feature extraction techniques . . . 62

4.2 Detailed PER with clean test set along with phoneme error distribution (%) given in parenthesis . . . 62

4.3 Detailed PRA(%) for different phoneme classes with clean data . . . 64

4.4 PRA(%) of different types of feature extraction techniques . . . 66

4.5 Detailed PER with clean test set along with phoneme error distribution (%) given in parenthesis . . . 67

4.6 Detailed PRA(%) for different phoneme classes with clean data . . . 68

4.7 Average computational complexities of different WP-based feature extraction techniques for the TIMIT and CSIR-TIFR database . . . 70

5.1 PRA(%) of different types of feature extraction techniques . . . 78

5.2 Detailed PER with clean test set along with phoneme error distribution (%) given in parenthesis . . . 79

5.3 Detailed PRA(%) of different phoneme classes with clean data . . . 80

5.4 Comparative PRA(%) of different systems on TIMIT database in clean condition . . . 81

5.5 PRA(%) of different types of feature extraction techniques . . . 83

(21)

5.7 Detailed PRA(%) of different phoneme classes with clean data . . . 84 5.8 Average computational complexities of different WP-based feature

extraction techniques for the TIMIT and CSIR-TIFR database . . . 86 6.1 Grouping of speakers in AVICAR database for AVSR experiment . . . . 91 6.2 AVICAR AVSR system experimental fold . . . 92

(22)

Prologue

(23)

mechanism drives the advancement of the speech technology. Artificial intelligence and robotics science would not have developed significantly without the improvement of Automatic Speech Recognition (ASR) systems . Speech as a medium of man-machine interaction has been gaining its importance in this modern computer/mobile era.

Mr.Bill Gates, co-founder of Microsoft Corporation, hails speech recognition as one of the most important innovations for future computer operating systems [1]. Recently his vision gets fulfilled when Microsoft launched the Windows 10 in July 2015.

Windows 10 comes with personal digital assistant named as ‘Cortona’ that is based on speech recognition and synthesis. Google and Apple also have their intelligent digital assistant known as ‘Google now’ and ‘Siri’ respectively. Each of these personal assistants is developed using different forms of artificial intelligence and personalization. However, all three assistant communicates with user through speech for performing the web search, entertainment, weather information, travel assistant, reminder, nearby place of interest, etc.

Speech recognition is a tool to ease the man-machine interaction. ASR is the process to understand a sequence of the spoken words using some sets of algorithms.

The earliest attempt to recognize a speech was made by the Bell Laboratories in 1952 [2]. They had designed a single speaker isolated digit recognizer. The system based on measuring the spectral resonances of vowel region during pronunciation of digits. Forgie and Forgie (1959) [3] made another attempt to recognize ten English vowels in a speaker-independent manner. Vowels were embedded in a form of /b/-vowel-/t/, and first to formant frequencies were determined to recognize the vowel.

Later, various researchers tried to exploit the basics of acoustics [4]. After the fruitful implementation of a connected digit recognizer, ASR systems started to have a great impact in the field of artificial intelligence, robotics, and computer vision. The ever-increasing range of applications of ASR make speech recognition a challenging and exotic domain of research. The important research outcomes and advancement of

Figure 1.1: Important developments and milestones of ASR technology over last seven decades.

speech recognition technologies over the last seven decades are presented in graphical

(24)

form in Fig. 1.1. In 1973, speech processing and understanding project was funded by the Defense Advanced Research Projects Agency (DARPA) [5] to implement continuous speech recognition that led to many seminal systems. After the successful implementation of connected digit recognizer, researches in speech technology found a new direction towards the voice based man-machine interaction [6]. Linear Predictive Coding (LPC) [7], Mel Frequency Cepstral Coefficients (MFCC) [8] , Hidden Markov Model (HMM) [9] are considered to be the most important breakthrough in the advancement of the speech technology.

1.1 Overview of Automatic Speech Recognition

Speech is a natural mode of communication for human beings. We come to earth without having any knowledge or skills. We learn and adapt all the relevant skills during early childhood from our surroundings, and continue to rely on speech as primary medium of communication throughout our daily lives [10]. Speech comes to us in a natural way and we don’t realize how complex theory is working behind speech production and perception. Human speech production and reception mechanism are build with the Vocal Tract (VT), articulators, ear, neural system and other biological organs that are non- linear in nature. The operation of speech production-perception system is affected by many parameters such as gender, age, pitch, articulation, accent, speed, pronunciation, background noises, echo etc. All of these constraints make speech recognition a very challenging and complex problem.

A machine is not expected to understand the uttered speech. But we need to transcript the acoustic speech to meaningful symbols to make them understand. The ultimate goal of research on ASR is to build a system that can communicate with humans in natural spoken language [11]. The typical components of an ASR system are shown in Fig. 1.2. The concept behind an ASR system will be clear with the

Figure 1.2: Typical components of an ASR system.

following quotes byMr. Mokokoma Mokhonoana [12].

“A bad handwriting is as annoying to a reader...as an irritating voice is to a listener.”

A bad writing contains missing or irrelevant information that may cause problems to a reader. A reader may not perceive the information properly as his/her brain doesn’t have any knowledge of the poorly written scripts. Now the functionality of the ASR will

(25)

be clear with that real life example. The three primary components of an ASR system are:-

• Signal Acquisition: Similar to writing equipment

• Feature Extraction: Similar to handwritten scripts

• Classifier: Similar to the knowledge of human brain

At first input speech is digitized and preprocessed to make it suitable for further processing. At front-end, there is feature extraction block to extract the most distinguishable properties of a speech signal. The purpose of feature extraction module is to calculate a set of observation vectors that drive classifiers to model speech events in a probabilistic space. A good feature extraction module should extract the most significant features of the speech by minimizing the background and channel noise. For recognition purpose, we certainly need a classifier at the back-end of the ASR systems.

This module is trained with the features to build the acoustic model of speech events.

Humans communicate by a sequence of meaningful words with the knowledge of the lexical model. Hence, alongside an acoustic model, a classifier needs a good language model and a dictionary to produce a relevant output.

1.1.1 Application of Speech Recognition

There are wide ranges of applications of ASR. Any task that involves a computer can potentially use ASR. The most popular use of ASR are:

• Speech recognition

• Home automation

• Audio search

• Biometrics

• Speech to text and Text to speech

• Robotics

• Hearing aids for physically disabled people

• Personal assistant

• Car navigation system

• Hands free computing

• Interactive voice response (IVR)

1.1.2 Challenges of Speech Recognition

The performance of the system heavily depends on the conditions under which it is evaluated. Under restricted conditions, the system can attain high performance.

However, under uncontrolled conditions it is difficult to achieve higher accuracy. The performance of the system can differ according to the following conditions [10]:

(26)

Vocabulary size: It is quite easy to discriminate a word among a small set of vocabulary. Error rates proportionately increase with the size of the vocabulary.

Small vocabulary speech recognition systems (usually less than 100 words) are used in command and control applications such as Interactive Voice Response (IVR), voice dialing, providing instructions etc. Large vocabulary speech recognition system (usually more than 5000 words) are used for continuous speech recognition, captioning live audio/video programs, etc.

Speaker dependent or independent: It is always interesting to study the response of the system when some unknown speaker evaluates it. The system is trained with some parameters that are highly speaker specific. Usually, it is difficult to attain same performance with Speaker Independent (SI) system as compared to Speaker Dependent (SD) system.

Isolated, discontinuous and continuous speech: Isolated speech deals with single word experiment; discontinuous speech means full sentences in which words are artificially separated by silence; continuous speech means naturally uttered sentences. Isolated and discontinuous speech recognition are relatively easier task because, word boundaries are apparently spotted, and words are nicely pronounced. However, in continuous speech word boundaries are not clear, and pronunciations are affected by coarticulation. The task got tougher when system encountered with spontaneous speech such as coughing, emotions, false start, ‘um’, ‘hm’, an incomplete sentence, etc.

Phonetics: Phoneme, in linguistics, is the fundamental unit of speech that are combined to other phonemes to form a meaningful word. Phonemes are language dependent, and an ASR system should have proper phonetic knowledge about the particular language.

Adverse conditions: Usually, a system is trained with clean acoustic data. The performance of the system is significantly affected by a range of adverse conditions [13]. These are background noise, acoustic mismatch, channel noise, etc.

1.2 Motivation of the Work

Speech recognition is a fascinating application of digital signal processing (DSP) that has many real-world applications. A close association of technology and applicability is essential. Many techniques have been developed to increase the performance.

Unfortunately, despite this success of the ASR over the decades, the performance of the same degrades in the uncontrolled environments. This phenomenon is particularly problematic in real life applications where the level of noise frequently changes. To counteract this low performance in the noisy environment, the focus of the speech recognition research in the past few decades has centered on making robust systems for recognizing speech in the noisy environments.

Speech signal is the composition of both periodic and aperiodic signal.

Conventional Short Time Fourier Transform (STFT) based feature extraction techniques are not efficient to detect the aperiodic part. [14–16]. To address this

(27)

problem some researcher [17–19] introduced Wavelet Transform (WT) [20, 21] to enhance the performance of ASR. However, literature reported that wavelet features are less efficient to show promising performance for periodic part of speech as compared to STFT features. This dissertation aims to extend the research on wavelet approaches and develop new auditory based wavelet feature extraction methods by which performance can be enhanced in uncontrolled environments.

1.3 Objective of the Work

The main objective of the thesis are:-

• To develop an auditory motivated wavelet feature extraction technique that can ensure better performance in noisy condition.

• To propose a new wavelet feature extraction technique that can take care of periodic part of the acoustic speech as well.

• To analyze the performance of the proposed feature extraction techniques in various types of noisy conditions.

• To study the trade-off between feature dimension and performance.

• To study the performance of the proposed techniques in Hindi language, alongside the English language.

• To develop audio visual speech recognition systems for automotive applications in uncontrolled environments.

1.4 Organization of the Thesis

This work proposes a flow of systematic investigation of psycho-acoustically motivated wavelet acoustic features to enhance the performance of the ASR. The work is presented in sequential steps to demonstrate the trade-off between feature dimension and performance. After the performance analysis of acoustic wavelet features, we extend the study to implement a dual modality Audio-Visual Speech Recognition (AVSR) in the automotive environment.

Figure 1.3: Organization of the Thesis

(28)

The thesis is organized into seven chapters beginning with Prologue of thesis. The remaining of the thesis is organized as follows: -

Chapter 2: (Fundamentals of the Speech Recognition) covers the fundamentals of speech production, phonetics, existing feature extraction techniques, HMM acoustic modeling and decoding. It is imperative to understand the underlying theories of speech recognition process to study and discuss new techniques aimed to improve the accuracy in disturbed environments. Some existing techniques for speech recognition are discussed with reference. The chapter also covers the overview of databases used in the work.

Chapter 3: (Psycho-acoustically Motivated Admissible Wavelet Packet Cepstal Features) presents a new auditory Equivalent Rectangular Bandwidth (ERB) scale motivated wavelet packet cepstral features.

We named it ERB like Wavelet Cepstral Features (WERBC). These filters have the frequency band spacing analogous to the auditory ERB scale whose central frequencies are equally distributed along the frequency response of human cochlea. The performance of the proposed technique is analyzed at different noise levels.

Chapter 4: (ERB like Wavelet Sub-band Periodic and Aperiodic Decomposition) introduces a wavelet based feature extraction technique that signifies both the periodic and aperiodic information to enhance the performance of wavelet features. This front- end feature processing technique is named as Wavelet ERB Sub-band based Periodicity and Aperiodicity Decomposition (WERB-SPADE) . We examine its performance for English and Hindi phone recognition task in noisy environments.

Chapter 5: (Auditory ERB-like WP Sub-band Harmonic Energy Features) introduces a new WP sub–

band features by considering the harmonic information of speech signal. It has been noticed that most of the voiced energy of the speech signal lies in between 250Hz-2000Hz. Thus, the proposed technique emphasizes the individual sub-band harmonic energy up to 2 kHz. The speech signal is decomposed into wavelet sub-bands, and Harmonic Energy Features (HEF) are combined with wavelet features to enhance the ASR performance.

Chapter 6: (Audio–Visual Speech Processing with Multiple Cameras in Car

(29)

Environment) presents the one-way Analysis Of Variance (ANOVA) based approach to fuse multiple camera streams for AVSR in-vehicle environments.

Multiple camera fusion technique is an imperative part of multiple camera computer vision applications.

Consideration of visual features along with traditional acoustic feature exhibits promising result in the complex auditory environment. Based on the ANOVA analysis multiple camera streams are fused into one visual feature vector. The dual modality AVSR system shows promising result across all driving conditions over conventional AVSR systems.

Chapter 7: (Conclusion and Future Scope) Summarizes the work presented in the thesis, and scope of future researches that need more investigations.

1.5 Summary

This chapter deals with introduction and history of ASR. The motivation and objectives of this research are also discussed. The aspiration of this research is to extract robust acoustic as well as visual features to enhance system performance in uncontrolled environments. This chapter also discusses the design issues and applications of speech recognition. Finally, a brief plan of work is described.

(30)

Literature Review on Speech Recognition

This chapter presents a relevant background in the field of speech recognition. This is an essential concept that has to be understood properly before deciding the approach to be used for speech recognition. In this chapter, after the discussions on speech production mechanism and phonetics, we explain the conventional feature extraction techniques followed by Hidden Markov Models in context of the work. Limitations of the existing STFT based feature extraction techniques are also discussed. These preliminary studies are helpful to understand the issues of the ASR and motivation to design research methodologies which are needful to develop new techniques in ASR.

(31)

2.1 Introduction

S

peech is the primary communication medium of humans. Human communication can be thought of as a sequence of five comprehensive processes from speech production to speech perception between the speaker and listener [4] as shown in Fig.

2.11.

Figure 2.1: Schematic diagram of the human speech production and perception process.

A: Speech Formulation, B: Speech Production, C: Acoustic Wave in Air, D: Speech Perception, E: Speech Comprehension

The first process (A. Speech formulation) is associated with the thinking and formulation of the speech in the speaker’s mind. This formulation is transmitted to acoustic speech wave (C. Acoustic wave in air) through the human vocal mechanism (B. Human speech production). The acoustic waveform is transferred via the air to the listener. During this propagation, the acoustic wave may be affected by external sources, for example noise, resulting in a more complex waveform. Human hearing system (D: Speech perception) perceive the speech signal, and listener’s mind starts to process the waveform to understand the information. The speech perception can be assumed as the inverse of the speech production. It is worth to mention that the speech production and perception mechanism is a nonlinear process as it involves an interaction between numbers of organs and aerodynamics [22, 23].

2.1.1 Chapter Organization

The rest of the chapter is organized as follows: - Section 2.2 presents the theoretical background of the speech production and representation. Section 2.3 describes the different existing front-end feature extraction techniques for speech recognition. Section 2.4gives a basic idea of the HMM as speech recognizer. Section2.5briefly describes the existing literatures on speech recognition. Research methodology is given in Section2.6 and section2.7 draws the summary of the chapter.

2.2 Speech Production

To understand speech production mechanism, one need to know the basic functions of human vocal tract. Fig. 2.2 shows the structure of the human vocal tract [24]. Vocal

(32)

Figure 2.2: Human vocal tract

tract together with mouth and nasal cavity play the important role in producing speech signal. Vocal tract acts as a filter, and its input comes from the lungs and trachea through the larynx. Trapdoor mechanism of velum is used to generate nasal sounds when required [4]. At the lower position of velum, the nasal cavity is connected with the vocal tract to produce the desired speech signal. The cross-sectional area of the vocal tract varies from 0-20cm2 [7].

Figure 2.3: Discrete time speech production model

Normal breathing mechanism allows air to enter the lungs. Air is released from the lungs into the trachea and then forced vocal cords to vibrate within the larynx. The airflow is cleaved into a chain of quasi-periodic pulses modulated in frequency and passes through the pharynx (throat cavity), mouth cavity and nasal cavity. Depending on the positions of the various articulators (mouth, velum, jaw, lips,tongue, etc), different sounds are generated. The lungs and trachea control the intensity of the generated speech, but usually they don’t make an audible contribution to acoustic speech [25].

(33)

It will be helpful for us to consider the discrete time model for speech production [4,26,27] as shown in Fig. 2.3. The impulse generator can be assumed as the lungs that excite the glottal filter G(z) to generate the excitation signal u(n). The G(z) can be considered as the vocal cords in the human vocal mechanism. The gain G is a factor that controls the volume of the generated speech. The vocal tract and the nasal cavity are modeled by the vocal tract filter H(z) that depend on a number of vocal tract parameters. Simply we can say the excitation signal gets convolved with the impulse response of the vocal tract over the time to produce speech signal.

2.2.1 Speech Representation

A speech signal is a slowly time varying signal and non-stationary in nature. We assume speech signal is short time stationary, when examined over a short period of time (between 5 and 100 ms) [7]. However, over the long period of time the characteristics of speech signal change according to the different speech sounds spoken by the speaker. The signal characteristics are analyzed using the state of the speech production source.

(a) The whole speech sequence

(b) First 25ms of the speech with the three state segmentation Figure 2.4: Three-state representation of the speech signal

The three-state representation is the simplest and straightforward way to represent the events in speech as shown in Fig. 2.4. The three-states are given by:

Silence (S) : No speech is produced.

Unvoiced (U): Vocal cords are not vibrating, resulting speech are random and

(34)

aperiodic in nature. Unvoiced speech are generated when the random signal generator (Fig. 2.3) acts as an excitation source.

Voiced (V): Vocal cords are excited by the impulse train generator and start to vibrate periodically, resulting speech signals are quasi-periodic in nature. Quasi- periodic means that the speech signal is periodic over a short-time span (5-100 ms) during which it is stationary.

However, the automatic segmentation of the speech into three defined states is not a straightforward approach. It is difficult to distinguish a weak sound from silence.

The boundaries between different states are not exactly defined. Hence, small errors in boundary conditions are ignored in most of the speech recognition applications.

Spectral domain approach is an alternative way to characterize and represent the speech signal. The sound spectrogram is a popular approach to study the signal characteristics. Fig. 2.5 shows the spectrogram of the speech signal shown in Fig.

2.4a. The darkest blue parts represent the silence parts of the speech where no speech

Figure 2.5: Spectrogram of the speech signal shown in Fig. 2.4a

is produced, and the red parts show the intensity of produced sound. The spectrogram is calculated using 25 ms of Hamming windowed frame with 10 ms of overlapping.

2.2.2 Phonetics of the Language

Any language can be described in terms of set of discriminative sounds called phoneme.

Phonemes are basic units of speech that can convey all meanings of a particular language.

Phonetics is a branch of linguistics that involves the study of sounds of human speech.

Figure 2.6: An example of phonetic transcription of a speech signal

(35)

Figure 2.7: Phoneme classification chart [4]

It deals with the physical properties of speech sounds such as physiological production and acoustic properties of phones [7, 27]. Phonemes are highly language specific and each language has different phoneme set. There are 48 phonemes for English [7], and 66 for Hindi [28].

Phonetically representation can be seen as a way to transcribe the different parts in a speech waveform as shown in Fig. 2.6. Phonemes are divided into continuant and non- continuant parts. A phoneme is continuant, if the speech sound is produced by a fixed vocal tract configuration. On the other hand, non-continuant phonemes are produced when the vocal tract configuration changes over time during the production of speech.

For example if the area in the vocal tract changes by different states of articulators, then the phoneme describing the produced speech is non-continuant. Phonemes can be grouped based on their articulatory property or frequency characteristics as shown in Fig. 2.7. The detail description of different phonemes is given in Appendix A.

2.3 Speech Feature Extraction Techniques

Speech processing on a digital computer requires sampling and storing of the analog speech signal generated using the microphone. Thus, at first analog speech signal is converted to a digital signal using an analogue-to-digital converter (ADC). For speech processing applications, the common sampling rates used are 8 kHz and 16 kHz.

However, the majority of the distinguishable information in the speech lies within the 8 kHz bandwidth [29]. Thus, a 16 kHz sampling rate is widely used to ensure the Nyquist sampling criterion.

The purpose of the feature extraction module is to convert the speech waveform to some parametric representation. The main steps involved in feature extraction module are preprocessing, frame blocking and windowing, feature extraction and normalization.

Figure 2.8: Feature extraction steps

(36)

2.3.1 Preprocessing

The objective of the preprocessing is to process the speech signal to make it more suitable for feature extraction analysis. The well-known characteristics of acoustic signal has the tendency to have less energy in high frequencies compared to low frequencies [30]. Thus, it is essential to utilize all information over the entire speech frequency spectrum. A first order high pass finite impulse response (FIR) filter is used to equalize the dynamic range of speech signal. The mostly used pre-emphasis filter to spectrally flatten the speech signal is given by [7]:-

H(z) = 1αz−1, 0.9≤α≤1.0 (2.1) In this work, α is set to 0.97. The output of the pre-emphasis filter, ˜s(n), is related to the input of the pre-emphasis filter, s(n), given by [7]:-

˜

s(n) =s(n)−αs(n−1) (2.2) wheren represents sample number.

2.3.2 Framing and Windowing

We have already mentioned speech signal is assumed to be stationary in short intervals (between 5-100 ms). Therefore, pre-emphasized speech signal ˜s(n) is divided into adjacent speech frames. Each frame is N ms long, with adjacent frames being separated by K ms as shown in Fig. 2.9. Typical values of N and K are 25 ms and 10 ms respectively. Next, each frame is processed through window function to minimize

Figure 2.9: Blocking of speech into overlapping frames

the signal discontinuities at the beginning and end of each frame. Typically, the Hamming window is applied on each frame, given by [7]:-

w(l) = 0.54−0.46 cos 2πl L−1

, 0≤l≤L−1 (2.3)

whereL is the length of the window, andl is the sample index.

2.3.3 Feature Extraction Technique

The next step is the heart of the front-end speech processing. This module extracts relevant information from the speech frames. There is a wide variety of feature extraction techniques. In this section, we will give a brief overview of widely used techniques for the same.

(37)

Linear Predictive Coding (LPC)

Linear prediction model is based on the human speech production procedure. The equivalent model of human vocal tract is shown in Fig. 2.10. A speech samples(n) at

Figure 2.10: Equivalent model of human vocal tract

timen, can be assumed as a linear combination of the pastj speech samples, such that s(n)≈a1s(n−1) +a2s(n−2) +...+ajs(nj) (2.4) where coefficientais assumed to be constant over the given speech frame. The equation 2.4can be converted to an equality by considering an additional term,Gu(n), giving [7]:

s(n) =Gu(n) +

j

X

i=1

ais(ni) (2.5)

where u(n) is the normalized excitation signal and G is the gain. In the z-domain we can write

S(z) =GU(z) +

j

X

i=1

aiz−iS(z) (2.6)

The transfer function of the system can be expressed as:- H(z) = S(z)

GU(z) = 1 1−

j

P

i=1

aiz−i

(2.7)

Equation 2.7 represents an all pole model of human vocal tract. The steps of LPC coefficient extraction is presented is Fig. 2.11. Each windowed speech frame is autocorrelated to find the presence of periodic signal. The autocorrelated speech is

Figure 2.11: LPC feature calculation

(38)

given by

n(m) =

N−1−m

X

x=0

x1(n)˜x1(n + m), m = 0,1,....p (2.8) where x1(n) is the windowed frame,pis the order of LPC analysis (typically in the range between 8-16). The next step is to formulate LPC parameter set from each frame ofp+1 autocorrelations. The set might be the LPC coefficients, the reflection coefficients, the log area ratio coefficients, the cepstral coefficients or any desired transformation of the above sets [7]. The Durbin’s method [31] is usually used for converting autocorrelation coefficients to an LPC parameter set. Next step is to derive LPC cepstral coefficients from LPC coefficients using recursive method, given by:-

c0= logG2 cm=am+

m−1

X

k=1

(k/m)ckam−k, 16m6p

cm=

m−1

X

k=1

(k/m)ckam−k, m > p

(2.9)

where G2 is the gain term of LPC model. The cepstral coefficients are the Fourier transform representation of the log magnitude spectrum.

Mel Frequency Cepstral Coefficients (MFCC)

It is perhaps the most popular feature extraction module used in ASR for relatively clean speech. MFCCs [8, 32, 33] are a perceptually motivated speech representation technique based on Fourier transform and mel-filterbank analysis [34, 35]. The psychoacoustic study states that human auditory system does not follow the linear scale to perceive speech. The mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The mapping of mel frequency (fmel) from sound frequency(fc) is given by:-

fmel = 2595log10

1 + fc 700

(2.10)

Figure 2.12: Mapping of frequency from Hz to Mel

The nonlinear mel frequency transformation is shown in Fig. 2.12. Mel frequency filterbank consists of a series of overlapping triangular bandpass filter. The lower

(39)

boundary of one filter is placed at the center frequency of the preceding filter, and the upper boundary is situated at the center frequency of the succeeding filter. The maximum response of a filter is located at the center frequency of the filter. Fig. 2.13 shows the frequency response of a 24 band mel filter bank.

Figure 2.13: 24 band mel scale filter bank

The processing steps of MFCC feature calculation is presented in Fig. 2.14. Each of windowed speech frame is transformed to the frequency domain using Discrete Fourier Transform (DFT) analysis. Next the frequency spectrum is passed through mel filterbank that produces a weighted sum of band-limited power spectrum values, given by [34]:-

Y(k) =

N 2−1

X

k=0

|S[k]|2|Hm[k]|, 16m6M (2.11) where S[k] is the N-point FFT spectrum of the windowed speech frame, m is the filterbank number,Hm[k] is the frequency response of the mth filter. In order to model

Figure 2.14: MFCC feature calculation

the perceived loudness, the mel filter outputs are processed by a logarithmic function.

The filter outputs are highly correlated due to the adjacent filter. Thus, at final step Discrete Cosine Transform (DCT) is applied to decorrelate the filter outputs. DCT has an excellent energy compaction property. That causes most of the signal energy to be contained in the lower few coefficients which enable reductions in dimensionality.

(40)

The kth MFCC coefficient is calculated using [34]:

M F CCk=

M−1

X

m=0

Aklog{Y(k)}cosπk(m−0.5) M

, 16k6p (2.12)

where

A0=qM1 , k= 0 Ak =qM2 ,16k6p

The zeroth order coefficient preserves the measure of the signal energy and is often included in the feature vector as phonemes tend to have differing energy levels.

Perceptual Linear Prediction (PLP)

The Perceptual Linear Prediction (PLP) [27, 36, 37] method is an another perceptually motivated feature extraction technique, inspired by the psychoacoustic study of human hearing. PLP is a further improvement over the LPCC that takes the advantage of three principal psychoacoustic characteristics such as critical band analysis, equal loudness curve adjustment, and intensity-loudness power law. The windowed spectrum will undergo a Bark-scale filter bank, which models the critical band frequency response of the human cochlea. PLP can be seen as a combination of DFT and LP techniques, and shows more robustness in the uncontrolled environment over MFCC and LPC. The steps involved to calculate PLP2 features is given in Fig. 2.15. A

Figure 2.15: PLP feature calculation

non-linear frequency scale named as Bark scale used to calculate PLP features. Bark filterbank consists of a series of overlapping trapezoidal shaped band-pass filter, given in Fig. 2.16. The center frequencies of the filterbank are distributed according to the Bark scale. The bark axis is derived from the following formula [38].

fBark(ω) = 6 ln

ω

1200π + ω 1200π

2

+ 1

!0.5

(2.13)

2Pre-emphasis is done after the critical band analysis

(41)

Figure 2.16: Bark scale filterbank

wherefBark shows the frequency in Bark scale,ω is angular linear frequency=2πf and f is the frequency Hz. The mapping of Bark scale from linear frequency scale is shown in Fig. 2.17. The resulting warped power spectrum is then convolved with the power

Figure 2.17: Mapping of frequency from Hz to Bark

spectrum of the simulated critical-band masking curve Ψ(Ω). The critical band masking curve is given in equation 2.14 [36].The output of the filterbank reduces the frequency selectivity over the original spectrum.

Ψ(Ω) =

0 f or fBark<−1.3

102.5(Ω + 0.5) f or −1.3 6 fBark<−0.5 1 f or −0.5 6 fBark<0.5

10−1.0(Ω−0.5)

f or 0.5 6 fBark <2.5 0 f or fBark>2.5

(2.14)

Next, the output of the filterbank is pre-emphasized by equal loudness curve that approximates the sensitivity of human hearing at different frequencies. The psychoacoustic study states that human hearing is more sensitive to the mid frequency range of the audible spectrum. The PLP technique incorporates this effect by

References

Related documents

By using one of the signal processing we can decompose namely wavelet decomposition by using “DISCRETE WAVELET TRANSFORM (DWT)”.In a sampling frequency, we can

This thesis has centered the consideration towards the investigation of Kalman filter (KF), Recursive Least squares (RLS), least mean square (LMS) and Variable

Speech Recognition; Soft Thresholding; Discrete Wavelet Transforms; Wavelet Packet Decomposition; Naive Bayes Classifier..

Index Terms—speech recognition, feature extraction, discrete wavelet transforms, wavelet packet decomposition, classification, artificial neural networks..

Then feature extraction using LFBC (linear frequency band cepstral), feature extraction method includes STDFT for converting the time domain signal into frequency domain..

Sourav Kumar Bhoi and Pabitra Mohan Khilar, SGIRP: A Secure and Greedy Intersection Based Routing Protocol for VANET using Guarding Nodes, IET Networks, 2013

Hence to supplement the complimentary features of the SIFT and SURF, a new Feature based image mosaicing technique using image fusion has been proposed and

&#34;Spatio-temporal feature extraction- based hand gesture recognition for isolated American Sign Language and Arabic numbers.&#34; Image and Signal Processing