Mandarin Singing Voice Synthesis Using ANN Vibrato Parameter Models

Mandarin Speech Synthesis Using Spectrum-Progression Model and HNM
HNM: Harmonic-plus-Noise Model

Hung-Yan Gu and Chang-Yi Wu
e-mail: guhy@mail.ntust.edu.tw

ABSTRACT

In this paper, an ANN based spectrum-progression model is proposed to improve the fluency level of synthetic Mandarin speech. First, each target syllable (uttered in a sentence) is matched with its corresponding reference syllable (uttered in isolation) by using dynamic time warping. Then, each warped path, i.e. spectrum-progression path, is time normalized to have fixed dimensions, and used to train an ANN based spectrum-progression model (SPM). After training, the SPM is used together with other modules such as text analysis, prosody parameter generation, and signal sample generation to synthesize Mandarin speech. Then, the synthetic speech is used to conduct perception tests. The test results show that the SPM proposed here can indeed improve the fluency level noticeably.

(a) Each Mandarin syllable has only one recorded utterance by a female. Therefore, no chance to do unit selection.
(b) The HNM parameters analyzed from a source syllable are used to synthesize syllables of diverse prosodic characteristics.
(c) The functions, speaking rate control and timbre transformation, are implemented.
(d) Computation Platform: an Intel Core 2 T5600 1.83GHz based notebook computer running Linux over WinXP.
(e) CPU rate means the consumed cpu time divided by the time length of the synthetic voice file.

Synthetic Speech: A white-flower tree (白花樹)

text file:
text_s5

Avg. syllable duration: 300ms; Pitch: 220Hz.

Recording of
program execution

Using spectrum-progression model and HNM.

Timbre transformation:
female => male(VTL:100/80%, Pitch:140Hz),

female => child(VTL:100/115%, Pitch:140Hz or 280Hz),

CPU rate:
13.2%

Using linear time mapping and HNM.

Using linear time mapping and PSOLA.

Direct concatenation of recorded syllables.

Synthetic Speech: The most-liked season (最喜歡的季節)

text file:
text_s3

Avg. syllable duration: 330ms; Pitch: 220Hz.

Recording of
program execution

Using spectrum-progression model and HNM.

CPU rate:
12.1%

Using linear time mapping and HNM.

Using linear time mapping and PSOLA.

Direct concatenation of recorded syllables.

Synthetic Speech: News reporting (新聞播報)

text file:
text_sn

Avg. syllable duration: 220ms; Pitch: 220Hz.

Recording of
program execution

Using spectrum-progression model and HNM.

Timbre transformation:
female => male (VTL:100/85%, Pitch:120Hz),

	CPU rate: 17.1%
	Pitch: 120Hz VocalTrackLen: (100/85)%

Using linear time mapping and HNM.

Using linear time mapping and PSOLA.