• Window Duration: how many samples we use for the analysis.

(1)

• If we assume the signal is piecewise stationary, we can analyze the signal using a sliding window

approach. Two key parameters are:

• Frame Duration: how often we perform the analysis.

• Window Duration: how many samples we use for the analysis.

• Recall we introduced similar parameters for the spectrogram. Typical values are a 10 ms frame duration and 25 ms window duration.

A FRAME-BASED ANALYSIS IS

ESSENTIAL

(2)

Important questions:

• How does the window duration impact the

– spectral resolution?

– Why so much overlap?

– Why do we use a 10 ms frame duration?

(3)

Vocoders

• Vocoders rely on speech-specific analysis-synthesis which is mostly based on the source-system model.

• Two-state excitation (pulse/noise), voicing and pitch detection, and filter-bank spectral representation were implemented using analog components in Dudley’s channel vocoder (1939).

• A vocal tract envelope representation is obtained

using a bank of band-pass filters (between 16 and 19).

• The performance of vocoders generally degrades for

non-speech signals.

(4)

Dudley’s vocoder

(5)

• Represents speech spectrum as the product of vocal tract and excitation spectra.

• A vocal tract envelope represented by using a bank of band-pass filters (number of channels 16 and 19).

• More accurate as the number of channels increases.

• The fine structure of the voiced spectrum is

represented using pitch-periodic pulse-like waves.

• Unvoiced speech is reproduced using noise like excitation.

• Improvements in low-rate channel vocoders were

realized by increasing the number of channels.

(6)

• The main difference between the channel vocoder and early formant vocoders is that the resonant characteristics of the filter-bank in formant

vocoders adapt to the trajectories of the formants.

• More recent implementations of the formant

vocoders employ cascade and parallel resonator

configurations.

(7)

Model Based Coding

• We assume a model vocal tract transfer function as:

• LPC coder => 100 frames/sec, 13 parameters/frame (p=10 LPC coefficients, pitch period, voicing decision, gain) => 1300

parameters/second for coding <=> versus 8000 samples/sec for the waveform.







 



p

1 k

k k

z a )

z ( P

)) z ( P 1

(

G )

z ( A

G )

z ( S

) z ( ) X

z

(

H

(8)

LPC Methods

• LPC methods are the most widely used in speech coding, speech synthesis, speech recognition,

speaker recognition and verification and for speech storage

– LPC methods provide extremely accurate estimates of speech parameters, and does it extremely efficiently

– Basic idea of Linear Prediction: current speech sample can be closely approximated as a linear combination of past samples.

k p

1 k

k x ( n k ) for some values of p and )

n (

x     



(9)

LPC Methods

• For periodic signals with period N

_p

, it is obvious that

• But that is not what LP is doing; it is estimating x(n) from the p (p<<N

_p

) most recent values of x(n) by linearly

predicting its value.

• For LP, the predictor coefficients (the 's) are determined (computed) by (over a finite interval) minimizing the sum of squared differences between the actual speech samples and the linearly predicted ones.

) N n

( x )

n (

x   _p

(10)

LPC Methods

• LP is based on speech production and synthesis models.

• Speech can be modeled as the output of a linear, time- varying system, excited by either quasi-periodic pulses or noise; LP provides a robust, reliable and accurate method for estimating the parameters of the linear system ( the combined vocal tract, glottal pulse, and radiation characteristic for voiced speech)

• Linear predictive techniques is used in speech coding as part of a differential quantization scheme;

concentrated on finding an adaptive predictor in DPCM

to reduce the variance of the difference signal.

(11)

LPC Methods

• LP methods have been used in control and

information theory—called methods of system estimation and system identification

– used extensively in speech under group of names including

1. covariance method

2. autocorrelation method 3. lattice method

4. inverse filter formulation

5. spectral estimation formulation 6. maximum likelihood method 7. inner product method

(12)

Basic Principles of LP

Use the time-varying digital filter to represent the glottal pulse shape, the vocal tract IR and the radiation effects ie

-System excited by an impulse train for voiced speech or a random sequence for unvoiced speech

-Already know how to estimate pitch period and V/UV

-This "all-pole model" is a natural representation for non-nasal voiced speech, but also works reasonably well for nasals





 



 _p

1 k

k kz a 1

G )

z ( U

) z ( ) S

z ( H

(13)

LP Basic Equations

A p

^th

order linear predictor is a system of the form

 

 

 











 ^p

1 k

p

1 k

k k

k S(z)

) z ( S~ z

) z ( P )

k n ( s )

n (

~s

The predictor error, e(n) is of the form

















^p

1 k

k

s ( n k ) )

n ( s ) n (

~ s

)

n

(

s

)

n

(

e

(14)

LP Basic Equations

The prediction error is the output of a system with transfer function

 

 





 ^p

1 k

k z k

) 1 z ( S

) z ( ) E

z ( A

If the speech signal obeys the production model exactly, and if 

_k

= a

_k

, 1 k  p

e(n)=Gu(n) and A(z) is an inverse filter for H(z) ie.

H(z)=G/A(z)

(15)

LP Estimation Issues

• Need to determine {

_k

} directly from speech such that they give good estimates of the time-varying spectrum

• Need to estimate {

_k

} from short segments of speech.

• Need to minimize mean-squared prediction error over short segments of speech

• Resulting {

_k

} assumed to be the actual {a

_k

} in the speech production model.

– intend to show that all of this can be done efficiently, reliably, and for speech accurately.

(16)

Solution for { 

_k

}

• short-time average prediction error is defined as:

 





 



  





 



 



m 

p

k

n k n

m

n n

m n n

k m s m

s

m s m s m

e E

2

1

` 2 2

) (

) ˆ (

) ( )

(



• Select segment of speech s

_n

(m)=s(m+n) in the vicinity of sample n.

• The key issue to resolve is the range of for summation.

(17)

Solution for { 

_k

}

• Can find values of _k that minimize E_n by setting:

p ..., 2, 1, i

,

E 0

i

n  





giving the set of equations

p i

1 , 0 )

( ) (

2

p i

1 , 0 )

ˆ ( )

( )

( 2

1









 

















 



m e i m s

k m s

m s

i m s

m

n n

m

p

k

n n

n



^k

Where are the values of that minimizes E_k _n

^ _k

Prediction error (e_n(m)) is orthogonal to signal (s_n(m-i)) for delay (i) of 1 to p

(18)

Solution for { 

_k

}

• defining

 ^ ^





m

n n

n ( i , k ) s ( m i ) s ( m k )

• We get

p ..., 2, 1, i

), 0 , i ( )

k , i

(

_n

p

1 k

n

k

   

 



• Leading to a set of p equations in p unknowns that can be solved in an efficient manner for the

{ 

_k

}

(19)

Solution for { 

_k

}



^ ^ ^



 m

n n

p 1 k

k m

n2

n s (m) s (m)s (m k) E

) k , 0 ( )

0 , 0 (

E _n

p 1 k

k n

n   



 



Minimum mean square error prediction error is of the form:

Which can be written in the form:

Process:

1. Compute _n(I,k) for 1 I  p, 0  k  p.

2. Solve matrix equation for k.

• Need to specity the range of m to compute _n(I,k).

• Need to specify s_n(m)

(20)

(21)

(22)

Formant Vocoder

(23)

• The transfer function, for voiced speech synthesis consists of a cascade of three (or generally) second order all-pole resonators,

• where

• and denote the i-th formant frequency and

bandwidth respectively.

(24)

• For unvoiced speech consists of a cascade of a second order all-zero function (zero at ) and a second-order all-pole function (pole at ).

• The fixed spectral compensation function

accommodates the effects of the glottal pulse and the lip radiation.

• The major difficulty in formant vocoders lies in the computation of the formants and their

bandwidths.

(25)

Homomorphic Vocoders

• The basic idea in homomorphic vocoders is that

the vocal tract and the excitation log-magnitude

spectra can be combined additively to produce

the speech log-magnitude spectrum.

(26)

• The inverse Fourier transform of the log-magnitude spectrum of speech produces the cepstral

sequence.

• the ("quefrency") samples of the cepstrum that are near the origin are associated with the vocal tract and are extracted using a cepstral window.

• The length of the cepstral window is shorter than the shortest possible pitch period.

• It can be also shown that for voiced speech the

cepstral sequence has large samples at the pitch

period. Therefore the fundamental frequency can

be estimated from the cepstrum.

(27)

MBE Analysis

D. Griffin and J. Lim, "Multiband Excitation Vocoder," IEEE Trans.

ASSP-36, No. 8, pp.1223, Aug. 1988.

(28)

MBE Synthesis

(29)

• The Multiband Excitation (MBE) coder, proposed by Griffin and Lim, relies on a model that treats the short-time speech spectrum as the product of an excitation spectrum and a vocal tract envelope.

• Uses classical two-state source-system model, the difference here is that the excitation spectrum is modeled by a

combination of harmonic and random-like contributions (i.e., voicing is frequency dependent).

• This mixed modeling approach is based on the fact that the spectra of mixed sounds or noisy speech contain both voiced (harmonic) and unvoiced (random-like) regions. Consequently the spectrum is divided into sub-bands and each sub-band is declared voiced or unvoiced. The number of sub-bands is much higher than the traditional sub-band coders and can be chosen to be equal to the number of harmonics.

Multi-band Excitation

(30)

• The analysis process consists of determining:

• a) the pitch period,

• b) the voiced and unvoiced envelope parameters,

• c) the voicing information for each sub-band, and

• d) selecting the voiced and unvoiced envelope parameters for each sub-band.

• An integer pitch period is first estimated using an

autocorrelation-like method and a pitch tracker is used to smooth the estimate for inter-frame continuity. This is then followed by a frequency-domain pitch refinement process.

The spectral envelope is described by samples located at the harmonics of the fundamental.

• For voiced harmonics the magnitude and phases of the envelope samples are determined using a least-squares process.

• For unvoiced harmonics only the magnitudes are determined.

Analysis Process

(31)

Synthesis Process

• The voiced portion of speech is synthesized in the time domain using a bank of harmonic sinusoids. The

amplitudes of the sinusoids are obtained from the voiced envelope samples.

• The amplitudes of the sinusoids associated with unvoiced harmonics are set to zero.

• The phases of the sinusoids (voiced bands) are determined using a phase prediction algorithm.

• The unvoiced portion of speech segments is determined by applying the FFT on a windowed segment of white noise.

The normalized transform samples are then multiplied by the spectral envelope and unvoiced synthesis is performed using the weighted overlap/add method.

(32)

MBE parameters at 8 kbits/s

• Frame rate in this is 50 Hz

• The difference between the estimated and predicted phases is coded only for voiced harmonics.

• When all the harmonics are unvoiced then no phase

information is coded and the extra bits are allotted to the magnitudes.

• Voicing information is coded at one bit per sub-band.

• The average DRT scores for this implementation were: 96.2 for noiseless speech and 58 for speech corrupted by wideband noise.

• D. Griffin and J. Lim, "Multiband Excitation Vocoder,"

IEEE Trans. ASSP-36, No. 8, p. 1223, Aug. 1988.

(33)

4.8 kbits/s MBE coder

• Frame rate of 50 Hz.

• The voiced/unvoiced decision was encoded in groups of three harmonics each and a total of 12

voiced/unvoiced decisions per frame were encoded at 1 bit per group.

• For frames with more than 36 harmonics the rest of the harmonics are declared unvoiced. The pitch is encoded using a variable differential coding scheme with an average of six bits per frame.

• Phase is quantized only for the first 12 voiced

harmonics using a phase prediction algorithm. The phase prediction residual is block quantized (in

groups of 3 phases) at 2 bits per phase component.

The rest of the phases, associated with voiced

harmonics, are not coded and are chosen randomly

using a uniform distribution.

(34)

• Each block is then transformed using an 8-point DCT transform and the DCT coefficients are encoded using uniform quantizers. The 4.8 kbits/s coder was one of the candidates for the DOD FS1016 standard. The DRT/DAM scores reported for the 4.8 kbits/s MBE were 92.7/60.4 and its complexity was estimated at 7 MIPS.

• An improved MBE (IMBE) proposed by Hardwick and Lim.

• This IMBE employs more efficient methods for quantizing the MBE model parameters.

• An IMBE that operates at 6.4 kbits/sec became part of the Australian (AUSSAT) mobile satellite standard and the International Mobile Satellite (Inmarsat-M) standard.

• MOS of 3.4.

DOD FS1016 standard

(35)

• Historically, digital speech signals are sampled at a rate of 8000 samples/sec. Typically, each

sample is represented by 8 bits (using mu-law).

This corresponds to an uncompressed rate of 64 kbps (kbits/sec).

• With current compression techniques (all of

which are lossy), it is possible to reduce the rate to 8 kbps with almost no perceptible loss in

quality.

• Further compression is possible at a cost of lower quality.

• All of the current low-rate speech coders are

based on the principle of linear predictive coding

(LPC).

(36)

LPC Parameter Quantization

• We don’t use prediction coefficients (large dynamic range, can become unstable when quantized) => use LPC poles, PARCOR coefficients, etc.

• Code LP parameters optimally using estimated pdf’s for each parameter

– V/UV-1 bit 100 bps

– Pitch Period-6 bits (uniform) 600 bps – Gain-5 bits (non-uniform) 500 bps

– LPC poles-10 bits (non-uniform)-5 bits for BW and 5 bits for CF of each of 6 poles 6000 bps

• Total required bit rate 7200 bps

– no loss in quality from uncoded synthesis (but there is a loss from original speech quality).

– quality limited by simple impulse/noise excitation model.

(37)

LPC Coding Refinements

• Log coding of pitch period and gain

• Use of PARCOR coefficients (|k

_i

|<1) =>log area ratios g

_i

=log(A

_i+1

/A

_i

)—almost uniform pdf with small spectral sensitivity => 5-6 bits for coding.

• Can achieve 4800 bps with almost same quality as 7200 bps system above.

• Can achieve 2400 bps with 20 msec frames => 50

frames/sec.

(38)

PARCOR

• Definition: [PAR]tial auto [COR]relation coefficients

– LPC coefficients are: a₁, a₂, … a_P

– PARCOR coefficients are: k₁, k₂, … k_P

– It is easy to compute PARCOR from LPC and visa versa

• Review

– Rectangular tubes have reflection coefficients r_k = (A_k+1 – A_k)/(A_k+1 + A_k)

– With algebra the ratio of areas between tubes are:

A_k/A_k+1 = (1-r_k)/(1+r_k)

• Importance

– LPC is equivalent to the tube model of the vocal tract – Log[A_k+1/A_k] = log[(1-k_i)/(1+k_i)]

– We can adjust the LPC parameters based on PARCOR

(39)

LPC to PARCOR

1 1 1

2

1

1 1

1 1 1

1





 







 

 









i i i

i i

j i i i i

i j j

j p

j

a k

i k j

a a a a

, ,

p,p i

for

p j

a





p j

a

i j

a k a

a

k a

,p i

for

p j j

i j i i i

j i

j

i i

i

















1 1 1

, 2 , 1

1 1





PARCOR to LPC

(40)

LPC-10 Vocoder

• LPC 10 vocoder

• US government standard

– Covarinace LP analysis

• Bit rate

• Frame rate = 44.44 frames/sec

• Total=54, bit rate=2400 bits/sec

param k₁ – k₄ k₅ – k₈ k₉ k₁₀ pitch amp sync.

# bits 5 4 3 2 7 5 1

(41)

The Federal Standard FS1015: LPC-10

Transmitter

(42)

The Federal Standard FS1015: LPC-10

receiver

(43)

• In 1976 DoD recommended an LPC algorithm for secure communications at 2.4 kbps. The algorithm, known as the LPC-10, eventually became the Federal Standard FS-1015.

• The LPC-10 uses a 10-th order predictor to estimate the vocal-tract parameters. The prediction parameters are estimated by solving the covariance matrix equations

• Pitch and voicing are smoothed using a dynamic programming algorithm and encoded at seven bits.

• Gain information is transmitted by encoding a root mean square (RMS) parameter at five bits per frame.

• The DRT and DAM score for the LPC-10 were found to be 90 and 48 respectively for noiseless speech.

• Required 20 MIPS of processing power, 2 kilobytes of RAM

(44)

VQ-Based LPC Coder

• Case 1: same quality as 2400 bps LPC vocoder

– 10-bit codebook of PARCOR vectors.

– 44.4 frames/sec.

– 8-bits for pitch, voicing, gain.

– 2-bit for frame synchronization.

– total bit rate of 800 bps.

• Case 2: same bit rate, higher quality

– 22 bit codebook => 4.2 million codewords to be searched.

– never achieved good quality due to computation, storage.

(45)

VQ-Based LPC Coder

(46)

Mixed Excitation Models

• Problems of 2 state model

– voicing errors which degrade the speech quality and intelligibility – inadequacy in cases of voicing transitions (mixed voiced/unvoiced

speech) or weakly voiced speech.

• Pulse train (buzz) excites the low-frequency region of the LPC synthesis filter and the noise excites the high-

frequency region of the synthesis filter.

• The excitation filters and gains are chosen such that the overall excitation spectrum is flat. The same time-varying cutoff frequency (fc) is used for both excitation shaping filters.

• A 2400 bits/s mixed excitation LPC vocoder achieved a

DAM score of 58.9

(47)

Mixed Excitation Models

(48)

(49)

LPC-Based Speech Coders

• The key problems with speech coders based on all-pole linear prediction models

– inadequacy of the basic source/filter speech production model

– idealization of source as either pulse train or random noise

– lack of accounting for parameter correlation using a one-dimensional scalar quantization method => aided greatly by using VQ methods.

(50)

(51)

Residual Excited Linear Prediction

• It is a waveform excited LP coder in which the synthesis filter is excited by the residual signal.

• To reduce the bit rate the residual signal is down sampled.

• Finally waveform coding is done (ADM) on the residual signal.

• ADM signal and the LPC coeff. are transmitted to the receiver.

• This RELP vocoder compresses the bandwidth of

the residual to 800Hz thereby coding only the

baseband of the residual at 5 kbits/s.

(52)

s[n] c[n]

a[k]

1-A(z)

LPC Analysis

LPF R ADM

Quantizer

+

A(z)

Decoder R High Freq.

Regeneration

(53)

• The first two blocks implement an PCM conversion with linear quantization that uses– 13 bits/sample.

• The bit rate of the resulting stream is: 8000 × 13 = 104 kbps.

The third block implements the GSM compression algorithm

called: Regular Pulse Excited - Long Term Prediction (RPE-

LTP) that reduces the digital speech rate to 13 kbps.

(54)

• PCM coded speech at 64 kbps is received from MSC is converted into 13 bit uniform quantized signal having the already known 104 kbps rate.

• The RPE/LTP speech compression block reduces the

bit rate to 13 kbps.

(55)

• The Regular Pulse Excited - Long Term Prediction (RPE- LTP) speech encoder of the GSM is the result of intense development work.

• The GSM group studied several speech coding

algorithms on the basis of subjective speech quality and complexity (which is related to cost, processing delay, and power consumption once implemented) before

arriving at the choice of a Regular Pulse Excited Linear

Predictive Coder (RPE_LPC) with a Long Term Predictor

loop.

(56)

Speech is divided into 20 millisecond samples, including:

104 × 20 = 2080 bits. Each of which is encoded as 260 bits,

giving a total output bit rate of: 260 / 20 = 13 kbps. The 260

output bits are divided into: 36 LPC bits; 36 LTP bits; 188

RPE bits.

(57)

GSM RPE LTP Coder

x(n)

Weighting Filter and RPE Grid

Selection LPC Inverse

Filter

Pitch Inverse

Filter

APCM Quantiser

LPC Analysis

M U X

s(n)

Pitch Parameters LPC Parameters Pitch

Analysis

Grid Position

(58)

GSM RPE LTP Decoder

D E M U

X

^Pitch

Parameters

LPC Synthesis

Filter Pitch

Synthesis Filter Residual

Decoder

s^{^}(n) Up-

Sampling

Grid Position

LPC Parameters

(59)

Agenda

• Generalised AbS Coding

• Multi-Pulse LPC (MPLPC)

• Code-Excited Linear Predictive

Coding (CELP)

(60)

AaS vs AbS

Analysis-and-synthesis

- Coded speech is not analysed.

- Errors accumulated from previous frames are not considered.

Analysis-by-synthesis

+ Far more succesful at 4.8-9.6 kb/s.

(61)

Analysis-by-Synthesis (AbS)

1) Time-varying filter

2) Excitation signal

3) Perceptually based

minimisation

procedure

(62)

Time-varying filters

• LPC or short-term predictor (STP).

• Pitch or long-term predictor (LTP).









_p

i

i i

z z a

A

1

1 1 )

( 1











_I

I i

i D i

z z b

P

( )

1 1 )

(

1

(63)

Perceptually based minimisation procedure

• MSE less meaningfull for low bit rates.

• Need a error criterium which is more in sympathy with human perception criterium.

• Use weighting filter.

1 0   













_p

i

i i i p

i

i i

z a

z a z

A z z A

W

1 1

1 1 )

/ (

) ) (

(

 

(64)

Weighting filter









_p

i

i i

i

z

a z

z w A z

A

1

1 ) 1 ) (

( 1 )

( 1





(65)

Weighting filter

(66)

Excitation signal

• Codebook excitation (CELP)

• Self-excitation (SELP)

• Multi-pulse LPC (MPLPC)

• Regular pulse excited LPC

(RPELPC)

(67)

Self-excitation (SELP)

CELP with an adaptive codebook

(68)

Multi Pulse LPC (MPLPC)

• One of the most important factors in generating natural- sounding speech is the excitation signal.

• Human ear is especially sensitive to pitch errors.

– Great deal of effort has been devoted to the development of accurate pitch detection algorithms.

– No matter how accurate the pitch is in a system using the LPC vocal tract filter, the use of a periodic pulse excitation that consists of a single pulse per pitch period leads to a “buzzy twang”.

• In 1982, Atal and Remde [234] introduced the idea of multipulse linear predictive coding (MP-LPC), in which several pulses

were used during each segment.

• The spacing of these pulses is determined by evaluating a number of different patterns from a codebook of patterns.

(69)

Multi Pulse LPC (MPLPC)

1. A codebook of excitation patterns is constructed.

2. Each entry in this codebook is an excitation sequence that consists of a few nonzero values separated by zeros.

3. For a segment from the speech sequence to be encoded, the encoder obtains the vocal tract filter using the LPC analysis described previously.

4. The encoder then excites the vocal tract filter with the entries of the codebook.

5. The difference between the original speech segment and the synthesized speech is fed to a perceptual weighting filter, which weights the error using a perceptual weighting criterion.

6. The codebook entry that generates the minimum average weighted error is declared to be the best match.

7. The index of the best-match entry is sent to the receiver along with the parameters for the vocal tract filter.

(70)

Multi-Pulse LPC Encoder

Pulse Positions + Amplitudes

+

-

Error Minimisation

LPC Inverse Filter

Long-term correlation

Analysis

Long-term Predictor Short-term

Predictor

Excitation Generator LPC

Analysis

+

M U X

s(n)

LPT Parameters

LPC Parameters

Local Decoder

(71)

Multi-Pulse LPC Decoder

Long-term Predictor

Short-term Predictor Excitation

Generator

D E M U X

LPT Parameters

LPC Parameters Pulse Positions

+ Amplitudes

s

^{^}

(n)

(72)

Regular Pulse

Excitation LPC (RPELPC)

• The pulse position are predefined in a structured manner.

• Less computation extensive.

• A performance loss.

(73)

Search Methods

• Pulses are optimized one by one

• Improvements:

1. Reoptimize the amplitudes when last pulse is determined.

2. Reoptimize the amplitudes after each

pulse determination.

(74)

Frame Size and Number of Pulses

• Large frames and more pulses for better performance.

• Small frames and less pulses for less computation.

• Around 40 samples per frame.

• 5 pulses per 4-5 ms.

(75)

MPLPC

Original Speech

Multipulse Excitation Synthesized Speech

Error

Signal

(76)

LTP in MPLPC

• LTP improves performance at:

– Low bit rates.

– In voiced regions with high pitch.

• The dominant pitch pulses are not

modeled by the excitation signal.

(77)

Code-Excited Linear

Predictive Coding (CELP)

(78)

FS1016 CELP

• Speech in the FS1016 CELP is sampled at 8 kHz and segmented in frames of 30ms.

• Each frame is segmented in subframes of 7.5ms.

• The excitation in this CELP is formed by combining vectors from an adaptive and a stochastic codebook with gains g

_a

and g

_s

respectively (gain-shape VQ).

• The excitation vectors are selected in every sub- frame by minimizing the perceptually weighted error measure.

• The codebooks are searched sequentially

starting with the adaptive codebook.

(79)

CELP Cont…

• The adaptive codebook contains the history of past excitation signals and the LTP lag search is carried over 128 integer and 128 non-integer delays.

• The stochastic codebook contains 512 sparse and overlapping codevectors.

• These entries are generated using a Gaussian random number generator, the output of which is quantized to −1, 0, or 1.

• If the input is less than −1.2, it is quantized to −1; if it is greater than 1.2, it is quantized to 1; and if it lies between −1.2 and 1.2, it is quantized to 0.

• Each codevector consists of sixty samples and each sample is ternary valued (1,0,-1) to allow for fast convolution.

• The codebook entries are adjusted so that each entry differs

from the preceding entry in only two places. This structure helps

reduce the search complexity.

(80)

CELP Cont…

• The adaptive codebook consists of the excitation vectors from the previous frame.

• Each time a new excitation vector is obtained, it is added to the codebook.

• Ten short-term prediction parameters are encoded as LSPs on a frame-by-frame basis.

• Subframe LSPs are obtained by applying linear

interpolation of frame LSPs.

(81)

CELP Cont…

• The zeros of P and Q lie on the unit circle in the complex plane.

• The zeros of P alternate with those of Q as we travel around the circle.

• As the coefficients of P and Q are real, the zeros occur in conjugate pairs

• A short-term pole-zero postfilter is also part of the standard.

• The computational complexity of the FS1016 CELP was

estimated at 16 MIPS (for partially searched codebooks)

and the DRT and MOS scores were reported to be 91.5

and 3.2 respectively.

(82)

4.8kbps DoD CELP Encoder FS 1016

Select the best Indices and Gains

for minimum weighted error

Filter

M U X Adaptive

Codebook

Excitation Parameters Fixed

Codebook

Compute and quantize LPC

Parameters X

X

+ +

Gain

LPC Parameters

-

+

(83)

4.8kbps DoD CELP Decoder

Adaptive Gain Nonlinear Smoothing

Filter D

E M U X

Adaptive Codebook

Fixed Codebook

X

+ Post

Filter

LPC Stability check and correction LPC

Parameters Excitation Parameters

(84)

CELP coders

• Speech quality in CELP coders can be enhanced by applying post-filtering on the speech sequence.

• Post-filters are used to emphasize the formant and the pitch structure of speech and a typical post-filtering configuration consists of a cascade of long- and short-term weighting filters with appropriate gains and bandwidth expansion parameters.

• One of the disadvantages of the original CELP algorithm is the large computational effort required for the codebook search.

In fact, many of the CELP algorithms require processors

capable of executing more than 20 MIPS and codebook

storage of the order of 40 kbytes.

(85)

Vector Sum Excited Linear

Prediction (VSELP)

(86)

VSELP

• The Vector Sum Excited Linear Prediction (VSELP) algorithm was proposed by Gerson and Jasiuk for use in digital cellular and mobile communications. An 8 kbits/s VSELP algorithm was adopted for the North American Digital Cellular System.

• The 8 kbits/s VSELP algorithm uses highly structured codebooks which are tailored for reduced computational complexity and increased robustness to channel errors.

• The VSELP excitation is derived by combining excitation

vectors from three codebooks, namely, a pitch-adaptive

codebook and two highly structured stochastic codebooks.

(87)

VSELP Cont…

• Speech sampled at 8 kHz is first pre-processed using a fourth-order high-pass Chebyshev filter.

• The frame in the VSELP algorithm is 20 ms long and each frame is divided into four 5ms sub-frames.

• A 10-th order short-term synthesis filter is used and its coefficients are encoded as reflection coefficients once per frame with bit allocation {6/5/5/4/4/3/3/3/3/2}.

• Sub-frame LPC parameters are obtained through linear

interpolation. The excitation parameters are updated

every 5ms. The excitation is coded using gain-shape

vector quantizers.

(88)

VSELP Cont…

• The codebooks are searched sequentially and the codevectors are determined using closed-loop perceptually weighted MSE minimization.

• The long-term prediction lag (adaptive codebook) is searched first; assuming no input from the stochastic codebooks. The adaptive codebook is fully searched for lags of 20 to 146 (127 codes) and the 128th code is used to indicate that the LTP is not used.

• The complexity of the 8 kbits/s VSELP was reported to be more than 13.5 MIPS (typical 20 MIPS) and the MOSs reported were 3.45 (low) and 3.9 (high).

• A 6.7 kbits/s VSELP algorithm was adopted for the Japanese

digital cellular standard and VSELP algorithms are candidates for

the half-rate North American and the GSM cellular standards.

(89)

Coding Delay

• One of the problems in speech compression methods which utilize delayed-decision coders is that coding gain is achieved at the expense of coding delay.

• The one-way delay is basically the time elapsed from the instant a speech sample arrived at the encoder to the instant that this sample appears at the output of the decoder.

• This definition of one-way delay does not include channel- or modem-related delays.

• The delay is basically attributed to data buffering, processing, and generation of coding bits (channel symbols).

• For example, a typical CELP algorithm with 20ms frames is

associated with a delay of about 60ms.

(90)

Low-Delay CELP (G.728) Encoder

(91)

Low-Delay CELP (G.728) Decoder

(92)

LD-CELP

• The low-delay CELP coder achieves low one-way delay by:

– a) using a backward-adaptive predictor, and – b) short excitation vectors (5 samples).

• In backward-adaptive prediction, the LP parameters are determined by operating on previously quantized speech samples which are also available at the decoder.

• The LD-CELP algorithm does not utilize LTP. Instead, the order of the short-term predictor is increased to fifty to

compensate for the lack of a pitch loop.

• The frame-size in LD-CELP is 2.5 ms and the sub-frames are 0.625ms long. The parameters of the 50-th order

predictor are updated every 2.5 ms. The LD-CELP uses

gain-shape VQ for the excitation.

(93)

LD-CELP Cont…

• The codebook consists of a 3-bit gain and 7-bit shape.

• A backward-adaptive excitation gain is also used.

The gain information is obtained from previously quantized excitation using a 10

^th

order predictor which operates on logarithmic gains.

• The gain is updated for every vector (0.625 ms) and

the coefficients of the log-gain predictor are updated

every 2.5 ms. The perceptual weighting filter is based

on 10th order LP operating directly on unquantized

speech and is updated every 2.5 ms.

(94)

LD-CELP Cont…

• The transfer function of the weighting filter is more general than the one introduced for conventional analysis-by-synthesis linear predictive coders, i.e.,

• Improvements in the perceptual quality were realized for and in order to limit the buffering delay in LD-

CELP only 0.625 ms of speech data are buffered at a

time.

(95)

LD-CELP Cont…

• LD-CELP utilizes adaptive short- and long-term postfilters to emphasize the pitch and formant structures of speech. The single-tap long-term postfilter requires knowledge of the pitch which is estimated at the decoder from the decoded speech.

The short-term postfilter is of the form

• The 10-th order rational polynomial emphasizes the formant

structure of speech while the first order all-zero term

compensates for the spectral tilt. The parameters of the

rational polynomial are obtained as a by-product of the 50-th

order recursive analysis process. These parameters are

updated every 2.5 ms.

(96)

LD-CELP Cont…

• The one-way delay of the LD-CELP is less than 2 ms and MOSs as high as 3.93 and 4.1 were obtained.

• The speech quality of the LD-CELP was judged to be equivalent or better than the G.721 standard even after three asynchronous tandem encodings.

• The coder was also shown to be capable of handling voiceband modem signals at rates as high as 2400 baud (provided that perceptual weighting is not used).

• The coder was implemented on the AT&T DSP32C

processor and the complexity and memory

requirements were found to be: 10.6 MIPS and 12.4

kbytes for the encoder and 8.06 MIPS and 13.8 kbytes

for the decoder.

(97)

Adaptive Transform Coding (ATC)

• It performs the block transformation of the windowed speech segment.

• Each segment is represented by a set of transformed coefficients which are separately quantized.

• At the receiver the quantized coefficients are inverse transformed to get the replica of the original segment.

• Adjacent segments are then joined together to form the synthesized speech.

• The N point transform coding gain over PCM is given as:

•

²

is the variance of the

signal and 

_j²

are the variance of the N transformed coefficients.

where

N / N 1

1 j

2 j 2

G

tc

 

 



 

 















^N

1 j

2 j 2

N

1

(98)

ATC Encoder

Buffer Frequency Transform

Quantize and Encode

Compute &

quantized spectral Information

Bit assignment

& stepsize computation Interpolation

M U X

Side information Data

(99)

ATC Decoder

Side information Data

Buffer Inverse

Transform Decode

Bit

assignment &

stepsize computation Interpolation

D

E

M

U

X

(100)

DFT vs. DCT

1D-DFT

real(a) imag(a)

n=7

u=0

u=7 u=0

u=7

1D-DCT

a

(101)

1-D Discrete Cosine Transform (DCT)

• Transform matrix A

– a(k,n) = (0) for k=0

– a(k,n) = (k) cos[(2n+1)/2N] for k>0

• A is real and orthogonal

– rows of A form orthonormal basis – A is not symmetric!

– DCT is not the real part of unitary DFT!

k N N

N k k n

k Z n

z

N k k n

n z k

Z

N

k N

n

) 2 ( 1 ,

) 0 (

2

) 1 2

cos ( ) ( ) ( )

(

2

) 1 2

cos ( ) ( ) ( )

(

1

0 1

0















 







 

















 

(102)

1-D DCT

k Z(k)

Transform coeff.

1.0 0.0 -1.0 1.0 0.0 -1.0 1.0 0.0 -1.0 1.0 0.0 -1.0

Basis vectors

100 0 -100

100 0 -100 100 0 -100 100 0 -100

100 0 -100 100 0 -100 100 0 -100 100 0 u=0 -100

u=0 to 1

u=0 to 4

u=0 to 5

u=0 to 2

u=0 to 3

u=0 to 6

u=0 to 7

Reconstructions n

z(n)

Original signal

(103)

DCT

• DCT is used in the ATC is very close to optimum Karhunen- Loeve Transform (KLT).

• The main advantage of DCT over KLT is that it is signal independent.

• It has even symmetry which helps to minimise the edge effect.



^



 

   



^N ¹

0 n

c

2 N

k ) 1 n

2 cos (

] k [ c ] n [ s ]

k [

S k= 0,1,…N-1

c(k)=1 for k=0 c(k)= ²



^



 

   



^N ¹

0 k

c

2 N

k ) 1 n

2 cos (

] k [ c ] n [ N S

] 1 n [

s n= 0,1,…N-1

(104)

• Window Duration: how many samples we use for the analysis.

• If we assume the signal is piecewise stationary, we can analyze the signal using a sliding window

approach. Two key parameters are:

• Frame Duration: how often we perform the analysis.

• Window Duration: how many samples we use for the analysis.

• Recall we introduced similar parameters for the spectrogram. Typical values are a 10 ms frame duration and 25 ms window duration.

A FRAME-BASED ANALYSIS IS

ESSENTIAL

Important questions:

• How does the window duration impact the

– spectral resolution?

– Why so much overlap?

– Why do we use a 10 ms frame duration?

Vocoders

• Vocoders rely on speech-specific analysis-synthesis which is mostly based on the source-system model.

• Two-state excitation (pulse/noise), voicing and pitch detection, and filter-bank spectral representation were implemented using analog components in Dudley’s channel vocoder (1939).

• A vocal tract envelope representation is obtained

using a bank of band-pass filters (between 16 and 19).

• The performance of vocoders generally degrades for

non-speech signals.

Dudley’s vocoder

• Represents speech spectrum as the product of vocal tract and excitation spectra.

• A vocal tract envelope represented by using a bank of band-pass filters (number of channels 16 and 19).

• More accurate as the number of channels increases.

• The fine structure of the voiced spectrum is

represented using pitch-periodic pulse-like waves.

• Unvoiced speech is reproduced using noise like excitation.

• Improvements in low-rate channel vocoders were

realized by increasing the number of channels.

• The main difference between the channel vocoder and early formant vocoders is that the resonant characteristics of the filter-bank in formant

vocoders adapt to the trajectories of the formants.

• More recent implementations of the formant

vocoders employ cascade and parallel resonator

configurations.

Model Based Coding





 





z a )

z ( P

)) z ( P 1

(

G )

z ( A

G )

z ( S

) z ( ) X

z

(

H

LPC Methods

• LPC methods are the most widely used in speech coding, speech synthesis, speech recognition,

speaker recognition and verification and for speech storage

– LPC methods provide extremely accurate estimates of speech parameters, and does it extremely efficiently

– Basic idea of Linear Prediction: current speech sample can be closely approximated as a linear combination of past samples.

k p

1 k

k x ( n k ) for some values of p and )

n (

x     



LPC Methods

• For periodic signals with period N

, it is obvious that

• But that is not what LP is doing; it is estimating x(n) from the p (p<<N

) most recent values of x(n) by linearly

predicting its value.

• For LP, the predictor coefficients (the 's) are determined (computed) by (over a finite interval) minimizing the sum of squared differences between the actual speech samples and the linearly predicted ones.

) N n

( x )

n (

x   p

LPC Methods

• LP is based on speech production and synthesis models.

• Linear predictive techniques is used in speech coding as part of a differential quantization scheme;

concentrated on finding an adaptive predictor in DPCM

to reduce the variance of the difference signal.

LPC Methods

x   _p

 ^p