• No results found

Lecture 25: Speech synthesis (Concluding lecture)

N/A
N/A
Protected

Academic year: 2022

Share "Lecture 25: Speech synthesis (Concluding lecture)"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Preethi Jyothi Nov 6, 2017


Automatic Speech Recognition (CS753)

Lecture 25: Speech synthesis (Concluding lecture)

Automatic Speech Recognition (CS753)

(2)

Recall: SPSS framework

Training

Estimate acoustic model given speech utterances (O), word sequences (W)

ˆ

Synthesis

Find the most probable ô from and a given word sequence to be synthesised, w

Synthesize speech from ô

ˆ

o = arg max

o

p(o|w, ˆ) ˆ = arg max p(O|W, )

Speech
 Analysis

Text
 Analysis

Train
 Model

Parameter
 Generation

Speech
 Synthesis

Text
 Analysis

speech

text

O

W

ô

ˆ

(3)

Synthesis using duration models

DURATION MODELING FOR HMM-BASED SPEECH SYNTHESIS

Takayoshi Yoshimura , Keiichi Tokuda , Takashi Masuko †† , Takao Kobayashi †† and Tadashi Kitamura

† Department of Computer Science

Nagoya Institute of Technology, Nagoya, 466-8555 JAPAN

†† Interdisciplinary Graduate School of Science and Engineering Tokyo Institute of Technology, Yokohama, 226-8502 JAPAN

ABSTRACT

This paper proposes a new approach to state duration modeling for HMM-based speech synthesis. A set of state durations of each phoneme HMM is modeled by a multi-dimensional Gaussian dis- tribution, and duration models are clustered using a decision tree based context clustering technique. In the synthesis stage, state durations are determined by using the state duration models. In this paper, we take account of contextual factors such as stress- related factors and locational factors in addition to phone identity factors. Experimental results show that we can synthesize good quality speech with natural timing, and the speaking rate can be varied easily.

1. INTRODUCTION

For any text-to-speech synthesis system, controlling timing of the events in the speech signal is one of the difficult problems since there are many contextual factors (e.g., phone identity factors, stress-related factors, locational factors) that affect timing. Fur- thermore some factors affecting duration interact with one an- other. Recently, there have been proposed some approaches to controlling timing using statistical models such as linear regres- sion [1], tree regression [2], MSR [3] which extends both linear and tree regressions, and sums-of-products model [4]. By using these techniques, rhythm and tempo of speech were successfully controlled with a small amount of free parameters.

On the other hand, we have proposed an HMM-based speech synthesis system in which the sequence of spectra is modeled by phoneme HMMs [5]. This synthesis system can synthesize speech with various voice characteristics by using a speaker adap- tation technique [6], [7] or a speaker interpolation technique [8].

In this paper, we propose a new approach to controlling rhythm and tempo for the HMM-based speech synthesis system. In this approach, rhythm and tempo are controlled by state duration den- sities. State durations of each phoneme HMM is modeled by a multi-dimensional Gaussian distribution. Duration models are clustered using a decision tree based context clustering technique [10]. In the synthesis stage, state durations which maximize the state duration probability are determined from the state duration models and the total length of speech.

Since state durations are modeled by continuous distributions, our approach has the following advantages:

• The speaking rate of synthetic speech can be varied easily.

• There is no need for label boundaries when appropriate ini- tial models are available since the state duration densities are estimated in the embedded training stage of phoneme HMMs.

TEXT

SYNTHETIC SPEECH

MLSA Filter Pitch

T or ρ Context Dependent

Duration Models Context Dependent

HMMs

Synthesis

d d

c c c c c c c

Mel-Cepstrum State Duration

HMM Sentence Densities State Duration

T

1 2

1 2 3 4 5 6

Figure 1: Speech synthesis system.

• Speaker individuality of synthetic speech can be varied by applying a speaker adaptation technique or a speaker inter- polation technique to the HMMs and their state duration models.

In the following, we summarize the HMM-based speech synthesis system, and describe the technique for state duration modeling in Sections 2 and 3, respectively. Experimental results and discus- sions are also given in Section 4.

2. HMM-BASED SPEECH SYNTHESIS SYSTEM

The synthesis part of the HMM-based text-to-speech synthesis system is shown in Fig. 1.

HMMs and their duration models are context dependent models, where contextual factors which affect both spectra and state dura- tions are taken into account.

In the training part, first, mel-cepstral coefficients are obtained from speech database using a mel-cepstral analysis technique [9], and delta coefficients are also calculated. Context dependent HMMs are trained using obtained coefficients. Using a decision-tree based context clustering technique [10], states of the context dependent HMMs are clustered, and the tied context dependent HMMs are

Use delta features
 for smooth trajectories

TEXT

T or 𝜌

Pitch

Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98

(4)

Transforming voice characteristics

We studied about speaker adaptation techniques for ASR

Maximum a posteriori (MAP) estimation

Maximum Likelihood Linear Regression

(MLLR)

and emotions. Although the combination of unit-selection and voice-conversion (VC) techniques (Stylianou et al., 1998) can alleviate this problem, high-quality voice conversion is still problematic. Furthermore, converting prosodic features is also difficult. However, we can easily change voice characteristics, speaking styles, and emotions in statistical parametric synthe- sis by transforming its model parameters. There have been four major techniques to accomplish this, i.e., adaptation, interpola- tion, eigenvoice, and multiple regression.

Adaptation (mimicking voices)

Techniques of adaptation were originally developed in speech recognition to adjust general acoustic models to a specific speaker or environment to improve the recognition accuracy (Leggetter and Woodland, 1995; Gauvain and Lee, 1994).

These techniques have also been applied to HMM-based speech synthesis to obtain speaker-specific synthesis systems with a small amount of speech data (Masuko et al., 1997; Tamura et al., 2001). Two major techniques in adaptation are maxi- mum a posteriori (MAP) estimation (Gauvain and Lee, 1994) and maximum likelihood linear regression (MLLR) (Leggetter and Woodland, 1995).

MAP estimation involves the use of prior knowledge about the distributions of model parameters. Hence, if we know what the parameters of the model are likely to be (before observing any adaptation data) using prior knowledge, we might well be able to make good use of the limited amount of adaptation data.

The MAP estimate of an HMM, λ, is defined as the mode of the posterior distribution of λ, i.e.,

λˆ = arg max

λ {p(λ | O, W)} (20)

= arg max

λ {p(O, λ | W)} (21)

= arg max

λ {p(O | W, λ) · p(λ)} , (22) where p(λ) is the prior distribution of λ. A major drawback of MAP estimation is that every Gaussian distribution is indi- vidually updated. If the adaptation data are sparse, then many of the model parameters will not be updated. This causes the speaker characteristics of synthesized speech to often switch between general and target speakers within an utterance. Vari- ous attempts have been made to overcome this, such as vector field smoothing (VFS) (Takahashi and Sagayama, 1995) and structured MAP estimation (Shinoda and Lee, 2001).

Adaptation can also be accomplished by using MLLR and Fig. 6 gives an overview of this. In MLLR, a set of linear trans- forms is used to map an existing model set into a new adapted model set such that the likelihood for adaptation data is maxi- mized. The state-output distributions8 of the adapted model set are obtained as

bj (ot) = N(ot ; ˆµj, Σˆ j), (23) µˆj = Ar(j)µj + br(j), (24) Σˆ j = Hr(j) ΣjHr(j), (25)

8The state-duration distributions can also be adapted in the same manner (Yamagishi and Kobayashi, 2007).

Transformed Model

General Model Linear Transforms

Regression Class

Figure 6: Overview of linear-transformation-based adaptation technique.

where µˆj and Σˆ j correspond to the linearly transformed mean vector and covariance matrix of the j-th state-output distribu- tion, and Ar(j), Hr(j), and br(j) correspond to the mean linear- transformation matrix, the covariance linear-transformation matrix, and the mean bias vector for the r(j)-th regression class. The state-output distributions are usually clustered by a regression-class tree, and transformation matrices and bias vec- tors are shared among state-output distributions clustered into the same regression class (Gales, 1996). By changing the size of the regression-class tree according to the amount of adap- tation data, we can control the complexity and generalization abilities of adaptation. There are two main variants of MLLR.

If the same transforms are trained for A and H, this is called constrained MLLR (or feature-space MLLR); otherwise, it is called unconstrained MLLR (Gales, 1998). For cases where adaptation data are limited, MLLR is currently a more effective form of adaptation than MAP estimation. Furthermore, MLLR offers adaptive training (Anastasakos et al., 1996; Gales, 1998), which can be used to estimate “canonical” models for train- ing general models. For each training speaker, a set of MLLR transforms is estimated, and then the canonical model is esti- mated given all these speaker transforms. Yamagishi applied this MLLR-based adaptive training and adaptation techniques to HMM-based speech synthesis (Yamagishi, 2006). This ap- proach is called average voice-based speech synthesis (AVSS).

It could be used to synthesize high-quality speech with the speaker’s voice characteristics by only using a few minutes of the target speaker’s speech data (Yamagishi et al., 2008b). Fur- thermore, even if hours of the target speaker’s speech data were used, AVSS could still synthesize speech that had equal or bet- ter quality than speaker-dependent systems (Yamagishi et al., 2008c). Estimating linear-transformation matrices based on the MAP criterion (Yamagishi et al., 2009) and combining MAP estimation and MLLR have also been proposed (Ogata et al., 2006).

The use of the adaptation technique to create new voices makes statistical parametric speech synthesis more attractive.

Usually, supervised adaptation is undertaken in speech synthe- sis, i.e., correct context-dependent labels that are transcribed manually or annotated automatically from texts and audio files are used for adaptation. As described in Section 3.1, pho- netic, prosodic and linguistic contexts are used in speech syn- 6

Can also be applied to speech synthesis

MLLR: estimate a set of linear transforms that map an existing model into an adapted model s.t. the likelihood of the adaptation data is maximized

For limited adaptation data, MLLR is more effective than MAP

Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

(5)

Transforming voice characteristics

What if no adaptation data is available?

HMM parameters can be interpolated

Synthesize speech with varying voice characteristics not encountered 
 during training

λ1

λ2

λ3

λ4

λ

I(λ,λ1) I(λ,λ2)

I(λ,λ3) I(λ,λ4)

Figure 7: Space of speaker individuality modeled by HMM sets {λi}. In this figure, {I,λi)} denotes interpolation ratio.

thesis. The use of such rich contexts makes unsupervised adap- tation very difficult because generating context-dependent la- bels through speech recognition is computationally infeasible and likely to produce very inaccurate labels. King et al. pro- posed a simple but interesting solution to this problem by only using phonetic labels for adaptation (King et al., 2008). King et al. evaluated the performance of this approach and reported that the use of unsupervised adaptation degraded its intelligi- bility but its similarity to the target speaker and naturalness of synthesized speech were less severely impacted.

Interpolation (mixing voices)

The interpolation technique enables us to synthesize speech with untrained voice characteristics. The idea of using interpo- lation was first applied to voice conversion, where pre-stored spectral patterns were interpolated among multiple speakers (Iwahashi and Sagisaka, 1995). It was also applied to HMM- based speech synthesis, where HMM parameters were inter- polated among some representative HMM sets (Yoshimura et al., 1997). The main difference between Iwahashi and Sag- isaka’s technique and Yoshimura et al.’s one was that as each speech unit was modeled by an HMM, mathematically-well- defined statistical measures could be used to interpolate the HMMs. Figure 7 illustrates the idea underlying the interpola- tion technique, whereby we can synthesize speech with various voice characteristics (Yoshimura et al., 1997), speaking styles (Tachibana et al., 2005), and emotions not included in the train- ing data.

Eigenvoice (producing voices)

Although we can mimic voice characteristics, speaking styles, or emotions by only using a few utterances with the adapta- tion technique, we cannot obtain adapted models if no adap- tation data are available. The use of the interpolation tech- nique enables us to obtain various new voices by changing the interpolation ratio between representative HMM sets even if no adaptation data are available. However, if we increase the number of representative HMM sets to enhance the capabili- ties of representation, it is difficult to determine the interpola- tion ratio to obtain the required voice. To address this problem,

Figure 8: Space of speaker individuality represented by super-vectors of HMM sets.

Shichiri et al. applied the eigenvoice technique (Kuhn et al., 2000) to HMM-based speech synthesis (Shichiri et al., 2002).

A speaker-specific “super-vector” was composed by concate- nating the mean vectors of all state-output distributions in the model set for each S speaker-dependent HMM set. By ap- plying principal component analysis (PCA) to S super-vectors {s1, . . . ,sS}, we obtain eigen-vectors and eigen-values. By retaining lower-order eigen-vectors (larger eigen-values) and ignoring higher-order ones (small eigen-values), we can ef- ficiently reduce the dimensionality of the speaker space be- cause low-order eigen-vectors often contain the dominant as- pects of given data. Using the first K eigen-vectors with arbi- trary weights, we can obtain a new super-vector that represents a new voice as

s = ¯µ +

!K

i=1

νiei, K < S, (26) where s is a new super-vector, µ¯ is a mean of the super-vectors, ei is the i-th eigen-vector, and νi is the weight for the i-th eigen- vector. Then, a new HMM set can be reconstructed from s. Figure 8 has an overview of the eigenvoice technique, which can reduce the number of parameters to be controlled, and this enables us to manually control the voice characteristics of syn- thesized speech by setting the weights. However, it introduces another problem in that it is difficult to control the voice char- acteristics intuitively because none of the eigen-vectors usually represents a specific physical meaning.

Multiple regression (controlling voices)

To solve this problem, Miyanaga et al. applied a multiple- regression approach (Fujinaga et al., 2001) to HMM-based speech synthesis to control voice characteristics intuitively (Miyanaga et al., 2004; Nose et al., 2007b), where mean vec- tors of state-output distributions9 were controlled with an L- dimensional control vector, z = [z1, . . . , zL], as

µj = Mjξ, ξ = "

1, z#

, (27)

9The state-duration distributions can also be controlled in the same manner (Nose et al., 2007b).

7

Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

(6)

GMM-based voice conversion

IEEE SIGNAL PROCESSING MAGAZINE [39] MAY 2015

keeping the linguistic information unchanged. Different from the linguistic features, which are used as inputs for speech synthesis, the input features for voice conversion are typically continuous acoustic representations of a source voice. Many statistical approaches to voice conversion have been studied since the late 1980s, such as codebook mapping [55], GMM [2], [56], frequency warping [57], neural networks [58], partial least square regres- sion [59], noisy channel model [60], etc. Among them, GMM- based voice conversion is the most popular [2], [56]. Figure 4 is a diagram of a typical GMM-based voice conversion system with parallel training data, which means that the training database contains the speech waveforms uttered by the source and target voices for the same texts. At the training stage, the acoustic fea- tures of the source and target speech in the training database are extracted by a vocoder and are aligned frame by frame by dynamic time warping. Then, the aligned pairs of the source acoustic fea- ture vector xt and the target acoustic feature vector yt are con- catenated to construct a joint feature vector zt =6x yt<, t< <@ . Similar to HMM-based speech synthesis, the acoustic features

xt and yt consist of static and dynamic components. There- fore, the acoustic feature sequences x =6x x1<, , ,2< f xT< <@ and

, , ,

y =6y y1< 2< f yT< <@ can also be written as a linear transform

from the static feature sequences xs=6x x<s1, , ,<s2 f x< <sT@ and , , ,

ys=6ys<1 y<s2 f y< <sT@ as x M x= x s and y M y= y s, where Mx

and My are determined by the velocity and acceleration calcula- tion functions [2]. Then, a joint distribution GMM (JD-GMM) m

with a set of parameters { ,m ( )mz, ( )}

mz mM

n 1

a R = is estimated to model

a joint PDF between the source and target acoustic features, where M denotes the total number of mixture components in the JD-GMM, and ,am n( )mz, and R( )mz correspond to the mixture weight, mean vector, and covariance matrix associated with the mth Gaussian component. The mean vector and covariance matrix are structured as

, .

( ) ( )

( ) ( ) ( )

( ) ( ) mz mx ( )

my mz mxx myx

mxy myy

n n

n R R

R R

== G == R G (10) To reduce the number of model parameters and computational cost, R( )mxx, R( )myy, R( )mxy, and R( )myx are commonly set to be diago- nal [2]. These model parameters are typically estimated by the ML criterion as

( , ), arg max p x y

* ;

m = m

m (11)

( ).

arg max p z

t T

t 1

; m

= m %= (12)

The conditional PDF given an input source acoustic feature xu can be further derived from the trained JD-GMM m* as

(y x, ) ( ,y m x, ),

p * p *

m

; m = ; m

6

u / u (13)

(m x, ) (y x, , ),

P * p m *

m

t t

T

t t

1

; m ; m

=

6 =

u % u

/ (14)

where m ={ , ,m1 f mT} denotes the sequence of mixture com- ponents. P(m x; u, )m* =%tT=1P m( t ; xut, )m* and P m( t ; xut, )m*

Vocoder Analysis Source

Speech

Source Speech

Target Speech

DTW Alignment

JD-GMM Training

Text Analysis

GMM Mixture Decision

Acoustic Parameter Conversion

Vocoder Synthesis

Converted Speech Conversion JD-GMM

Training

Vocoder Analysis

[FIG4] A block diagram of a typical GMM-based voice conversion system.

Parallel training data:

Align source and target speech frame-by-frame

Estimate a joint

distribution GMM to model the joint PDF between source/target features

At conversion time,

predict the most likely converted acoustic

features given a source acoustic feature sequence

Image from Ling et al., “Deep Learning for Acoustic Modeling in Parametric Speech Generation”, 2015

(7)

Neural approaches to speech generation

(8)

Recall: DNN-based speech synthesis

2. DEEP NEURAL NETWORK

Here the depth of architecture refers to the number of levels of com- position of non-linear operations in the function learned. It is known that most conventional learning algorithms correspond toshallowar- chitectures (! 3 levels) [20]. For example, both the decision tree and neural network with 1 hidden layer can be seen as having 2 levels.1 Boosting [25], tree intersections [19, 26, 27], or product of decision tree-clustered experts [28] add one level to the base learner (i.e. 3 levels). A DNN, which is a neural network with multiple hidden layers, is a typical implementation of a deep architecture. We can have a deep architecture by adding multiple hidden layers to a neu- ral network (adding one layer results in having one more level).

The properties of the DNN are contrasted with those of the de- cision tree as follows;

" Decision trees are inefficient to express complicated functions

of input features, such as XOR, d-bit parity function, or mul- tiplex problems [18]. To represent such cases, decision trees will be prohibitively large. On the other hand, they can be compactly represented by DNNs [20].

" Decision trees rely on a partition of the input space and using

a separate set of parameters for each region associated with a terminal node. This results in reduction of the amount of the data per region and poor generalization. Yuet al. showed that

“weak” input features such as word-level emphasis in reading speech were thrown away while building decision trees [29].

DNNs provide better generalization as weights are trained from all training data. They also offer incorporation of high- dimensional, disparate features as inputs.

" Training a DNN by back-propagation usually requires a much

larger amount of computation than building decision trees. At the prediction stage, DNNs require a matrix multiplication at each layer but decision trees just need traversing trees from their root to terminal nodes using a subset of input features.

" The decision trees induction can produce interpretable rules

while weights in a DNN are harder to interpret.

3. DNN-BASED SPEECH SYNTHESIS

Inspired by the human speech production system which is believed to have layered hierarchical structures in transforming the informa- tion from the linguistic level to the waveform level [30], this paper applies a deep architecture to solve the speech synthesis problem.

Figure 1 illustrates a speech synthesis framework based on a DNN. A given text to be synthesized is first converted to a sequence of input features fxnt g, where xnt denotes the n-th input feature at frame t. The input features include binary answers to questions about linguistic contexts (e.g. is-current-phoneme-aa?) and numeric values (e.g. the number of words in the phrase, the relative position of the current frame in the current phoneme, and durations of the current phoneme).

Then the input features are mapped to output features fymt g by a trained DNN using forward propagation, where ymt denotes the m-th output feature at frame t. The output features include spec- tral and excitation parameters and their time derivatives (dynamic features) [31]. The weights of the DNN can be trained using pairs of input and output features extracted from training data. In the

1 Partition of an input feature space by a decision tree can be represented by a composition of ORandAND operation layers.

Input layer Hidden layers Output layer TEXT

SPEECH Parameter

generation

...

... ...

Waveform synthesis Input features including binary & numeric features at frame 1

Input features including binary & numeric features at frame T

Text analysis Input feature

extraction

... Statistics (mean & var) of speech parameter vector sequence

x11

x12

x13

x14

xT1

xT2

xT3

xT4

h111

h112

h113

h114

hT11

hT12

hT13

hT14

y11

y21

y31

y1T

y2T

y3T

h131

h132

h133

h134

hT31

hT32

hT33

hT34

...

h121

h122

h123

h124

hT21

hT22

hT23

hT24

Fig. 1. A speech synthesis framework based on a DNN.

same fashion as the HMM-based approach, it is possible to gener- ate speech parameters; By setting the predicted output features from the DNN as mean vectors and pre-computed variances of output fea- tures from all training data as covariance matrices, the speech pa- rameter generation algorithm [32] can generate smooth trajectories of speech parameter features which satistify both the statistics of static and dynamic features. Finally, a waveform synthesis module outputs a synthesized waveform given the speech parameters.

Note that the text analysis, speech parameter generation, and waveform synthesis modules of the DNN-based system can be shared with the HMM-based one, i.e. only the mapping module from context-dependent labels to statistics needs to be replaced.

4. EXPERIMENTS 4.1. Experimental conditions

Speech data in US English from a female professional speaker was used for training speaker-dependent HMM-based and DNN-based statistical parametric speech synthesizers. The training data con- sisted of about 33 000 utterances. The speech analysis conditions and model topologies were similar to those used for the Nitech-HTS 2005 [33] system. The speech data was downsampled from 48 kHz to 16 kHz sampling, then 40 Mel-cepstral coefficients [34], loga- rithmic fundamental frequency (logF0) values, and 5-band aperi- odicities (0–1, 1–2, 2–4, 4–6, 6–8 kHz) [33] were extracted every 5 ms. Each observation vector consisted of 40 Mel-cepstral coeffi- cients, logF0, and 5 band aperiodicities, and their delta and delta- delta features (3 # (40 C 1 C 5) D 138). Five-state, left-to-right, no-skip hidden semi-Markov models (HSMMs) [35] were used. To model logF0 sequences consisting of voiced and unvoiced observa-

Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

Input features about linguistic

contexts, numeric values (# of words, duration of the phoneme, etc.)

Output features are spectral and

excitation parameters and their 


delta values

(9)

Recall: RNN-based speech synthesis

! = ("

#

, ⋯ , "

%

) , for a given input vector sequence & = ('

#

, ⋯ , '

%

) , iterating the following equations from ( = 1 to ) :

+

= ℋ ( -

./

'

+

+ -

//

+0#

+ 1

/

) (1)

"

+

= -

/2

+

+ 1

2

(2) where - is the weight matrices, e.g. -

./

is the weight matrix between input and hidden vectors; 1 is the bias vectors, e.g. 1

/

is the bias vector for hidden state vectors; and ℋ is the nonlinear activation function for hidden nodes.

ℋ is usually a sigmoid or hyperbolic tangent function in the conventional RNNs, but the gradient vanishing problem caused by these activation function prevents RNN from modeling the long-span relations in sequential features. Long short term memory (LSTM) network [11], shown in Fig. 1, which manually build a memory cell inside, can overcome the problems in conventional RNN and can model signals that have a mixture of low and high frequency components. For LSTM, ℋ is implemented with the following functions [12]:

3

+

= 4 ( -

.5

'

+

+ -

/5

+0#

+ -

65

7

+0#

+ 1

5

) (3) 8

+

= 49-

.:

'

+

+ -

/:

+0#

+ -

6:

7

+0#

+ 1

:

; (4) c

+

= 8

+

7

+0#

+ 3

+

(<=ℎ ( -

.6

'

+

+ -

>?

+0#

+ 1

?

) (5)

@

+

= 4 ( -

.A

'

+

+ -

/A

+0#

+ -

6A

7

+

+ 1

A

) (6) h

+

= o

+

(<=ℎ(c

+

) (7) where 4 is the sigmoid function; 3, 8, @ and 7 are input gate, forget gate, output gate and cell memory, respectively.

i

t

f

t

o

t

c

t

x

t

x

t

x

t

x

t

h

t

Input gate Forget gate Output gate

Cell

Fig.1. Long Short Term Memory

Bidirectional RNN [13] as shown in Fig. 2 can access both the preceeding and succeeding contexts. It separates the hidden layer into two parts, forward state sequence, ℎB⃗ , and backward state sequence, ℎ⃖B . The iterative process is:

ℎB⃗

+

= ℋ9-

./BB⃗

'

+

+ -

/BB⃗/BB⃗

ℎB⃗

+0#

+ 1

/BB⃗

; (8) ℎ⃖B

+

= ℋ9-

./⃖BB

'

+

+ -

/⃖BB/⃖BB

ℎ⃖B

+E#

+ 1

/⃖BB

; (9)

"

+

= -

/BB⃗2

ℎB⃗

+

+ -

/⃖BB2

ℎ⃖B

+

+ 1

2

(10) Deep bidirectional RNN can be established by stacking multiple RNN hidden layers on top of each other. Each hidden state sequence, ℎ

F

, is replaced by the forward and backward, ℎB⃗

F

and ℎ⃖B

F

, and the iterative process is:

ℎB⃗

+F

= ℋ9-

/BB⃗GHI/BB⃗G

ℎB⃗

+F0#

+ -

/BB⃗G/BB⃗G

ℎB⃗

+0#F

+ 1

/BB⃗F

; (11) ℎ⃖B

+F

= ℋ9-

/⃖BBGHI/⃖BBG

ℎ⃖B

+F0#

+ -

/⃖BBG/⃖BBG

ℎ⃖B

+E#F

+ 1

/⃖BBF

; (12)

"

+

= -

/BB⃗J2

ℎB⃗

+K

+ -

/⃖BBJ2

ℎ⃖B

+K

+ 1

2

(13) Deep bidirectional LSTM (DBLSTM) is the integration of deep bidirectional RNN and LSTM. By taking the advantages of DNN and LSTM, it can model the deep representation of long-span features.

Backward Layer

Forward Layer Input Output

Fig.2. Bidirectional RNN

3. DBLSTM-RNN based TTS Synthesis

Speech production can be seen as a process to select spoken words, formulate their phonetics and then finally articulate output speech with the articulators. So it is a continuous physical dynamic process. DBLSTM-RNN can simulate human speech production by a layered hierarchical and wide in time scale structure to transform linguistic text information into its final speech output. In a TTS synthesis system, where usually a whole sentence is given as input, there is no reason not to access long-range context in both forward and backward directions. We propose to use DBLSTM-RNN for TTS synthesis. The schematic diagram of DBLSTM-RNN based TTS synthesis is shown in Fig. 3.

AnalysisText Input Feature Extraction Input features

Output Features

Vocoder Waveform

Text

Fig.3. DBLSTM-RNN based TTS synthesis

In DBLSTM-RNN based TTS synthesis, rich contexts are also used as input features, which contain the binary features for categorical contexts, e.g. phone labels, POS labels of the current word, and TOBI labels, and numerical features for the numerical contexts, e.g., the number of words in a phrase or the position of the current frame of the current phone. The output features are acoustic features like spectral envelope and

2. Deep Bidirectional LSTM (DBLSTM) Recurrent Neural Network

Recurrent Neural Network (RNN) computes hidden state vector sequence L = (ℎ

#

, ⋯ , ℎ

%

) and outputs vector sequence

1965

Access long range context in both forward backward directions using biLSTMs

Inference is expensive; 


inherently have large latency

Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014

(10)

Frame-synchronous streaming 
 speech synthesis

... ...

...

x(i)

TEXT Text analysis

Linguistic feature extraction

...

... ... ...

Acoustic LSTM-RNN La

Duration LSTM-RNN Ld

Phoneme durations Frame-level

linguistic features

Phoneme-level linguistic features Acoustic

features

Vocoder Vocoder

... ... ...

Waveform

... ...

... ...

Recurrent output layer

Output layer

x(1) ... x(N)

dˆ(i)

x1(i) xdˆ(i)(i)

y1(i) ... ydˆ(i)(i)

... ˆ ˆ ...

...

...

...

Fig. 1. Overview of the proposed streaming synthesis architecture using unidirectional LSTM-RNNs with a recurrent output layer.

2. STREAMING SYNTHESIS USING UNIDIRECTIONAL LSTM-RNNS WITH RECURRENT OUTPUT LAYER

Figure 1 illustrates the proposed speech synthesis architecture using unidirectional LSTM-RNNs with a recurrent output layer. Here, du- ration prediction, acoustic feature prediction, and vocoding are exe- cuted in a streaming manner. The synthesis process can be outlined as follows:

1: Perform text analysis over input text

2: Extract fx.i /gNiD1

3: for i D 1; : : : ; N do F Loop over phonemes

4: Predict dO.i / given x.i / by ƒd

5: for D 1; : : : ;dOi do F Loop over frames

6: Compose x.i / from x.i /, , and dO.i /

7: Predict yO.i / given x.i / by ƒa

8: Synthesize waveform given yO.i / then stream result

9: end for

10: end for

where N is the total number of phonemes in the input utterance, and ƒd and ƒa are duration and acoustic LSTM-RNNs, respec- tively. x.i / and dO.i / correspond to the phoneme-level linguistic fea- ture vector and the predicted phoneme duration at the i-th phoneme.

x.i / and yO.i / are frame-level linguistic feature vector and the pre- dicted acoustic feature vector at the -th frame in the i-th phoneme, respectively. Note that the first two steps are sentence-level batch processing, whereas the remaining steps are streaming processing, as the first two steps are usually significantly faster than the remain- ing ones. The details of the LSTM-RNN and recurrent output layer

are described in the next section.

2.1. LSTM-RNN

The LSTM-RNN architecture is designed to model temporal se- quences and their long-term dependencies [18]. It has special units called memory blocks. The memory blocks contain memory cells with self-connections storing the temporal state of the network in addition to special multiplicative units called gates to control the flow of information. It has been successfully applied to vari- ous applications, such as speech recognition [26, 27], handwriting recognition [28], and speech synthesis [19–22].

Typically, feedback loops at hidden layers of an RNN are uni- directional; the input is processed from left to right, i.e. the flow of the information is only forward direction. To use both past and fu- ture inputs for prediction, Schuster proposed the bidirectional RNN architecture [15]. It has forward and backward feedback loops that flow the information in both directions. This architecture enables the network to predict outputs using inputs of entire sequence. The bidirectional version of LSTM-RNNs have been proposed [28] and applied to acoustic modeling for TTS [19, 22].

However, as inference using bidirectional LSTM-RNNs in- volves the propagation of inputs over time in both forward and backward directions, bidirectional LSTM-RNNs inherently have large latency; to predict the first frame of a sequence, inputs for the last frame need to be propagated through the network over time.

This prohibits using bidirectional LSTM-RNNs in commercial TTS services; if a user enter a very long text as input for TTS, its latency can be prohibitively large.

Unidirectional LSTM-RNNs do not have this issue as the for- ward propagation can be done in a frame-synchronous, streaming manner. They can still access future inputs by windowing, looking- ahead, or delaying outputs with reasonable increase in the number of parameters. This paper investigates unidirectional LSTM-RNNs as the acoustic model for TTS.

2.2. Recurrent Output Layer

A single hidden-layer, forward-directional RNN1 computes hidden activations fhtgTtD1 and output features fytgTtD1 given input fea- tures fxtgTtD1 by iterating the following recursion.

ht D f .Whxxt C Whhht 1 C bh/ ; (1) yt D Wyhht C by ; (2) where h0 D 0, Whx, Wyh, and Whh correspond to the weight matri- ces for input/hidden connection, hidden/output connection, and feed- back loop at the hidden layer. bh and by are the bias vectors for the hidden and output layers, respectively, and f . / and . / are the ac- tivation functions for the hidden and output layers, respectively. The feed-back mechanism in Eq. (1) – i.e. activations at the previous time step being fed back into the network along with the inputs, al- lows the network to propagate information across frames (time) and learn sequences.

The recurrent output layer is a simple extension of the conven- tional RNN; use recurrent connection at the output layer as well.

Equation (2) is extended to have recurrent term as

yt D Wyhht C Wyyyt 1 C by ; (3)

1For notation simplicity the activation function definitions for simple RNN are given here to describe recurrent output layer. In the actual im- plementation, ht is computed with an LSTM layer.

Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015

(11)

Deep generative models

Code (Gaussian, Uniform, etc.)

Deep generative model

Real data
 (images, sounds, etc.)

06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg

https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1

Image from https://blog.openai.com/generative-models/

Example:


Autoregressive 
 models (Wavenet)

(12)

Wavenet

Speech synthesis using an auto-regressive generative model

Generates waveform sample-by-sample:16kHz sampling rate

Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

(13)

Wavenet

Wavenet uses “dilated convolutions”

Main limitation: Very slow generation rate [Oct 2017: Wavenet deployed in Google Assistant

1

]

Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

1https://techcrunch.com/2017/10/04/googles-wavenet-machine-learning-based-speech-synthesis-comes-to-assistant/

(14)

Wavenet

Reduced the gap between state-of-the-art and human performance 
 by > 50%

Recording 1

Recording 2

Recording 3

Which of the three recordings sounded most natural?

(15)

Deep generative models

Code (Gaussian, Uniform, etc.)

Deep generative model

True data
 (images, sounds, etc.)

06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg

https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1

Image from https://blog.openai.com/generative-models/

Example:


Generative Adversarial Networks (GANs)

(16)

GANs

Training process is formulated as a game between a generator network and a discriminative network

Objective of the generator: Create samples that seem to be from the same distribution as the training data

Objective of the discriminator: Examine a generated sample and distinguish between fake or real samples

Solution to this game is an equilibrium between the generator and the discriminator

Refer to [ Goodfellow16 ] for a detailed tutorial on GANs

[Goodfellow16]: https://arxiv.org/pdf/1701.00160.pdf

References

Related documents

Object Categorization, Multi-modal Speech/Speaker/Activity recognition, Text Mining.. Build classifier using all features

It combines features in videolectures.net and lecture browser Open source application by integrating available speech recognition and text search engines.. Tune Sphinx

However if additional language model training data is available, then acoustic models are required for these unseen graphemes.. In contrast all the phones are observed in

I Computability: (Church Turing Hypothesis) All reasonable models of computation are equivalent, i.e., they decide the same class of languages. I Complexity: Choice of model

„ Step1: From each sense marked sentence containing the ambiguous word , a training example is constructed using:. POS of w as well as POS of

The approach is based on extracting textual information from lecture videos and indexing the text data to provide search features in lecture video repository.. We used Automatic

The circles refer to the acoustic vectors from the speaker 1 whereas the triangles are from speaker 2.In the training phase , using the clustering

The TFIS voice features are proposed using Generalized New Entropy function and Information Set theory concepts for the text-independent speaker recognition.. The extracted