Mandarin Speech Synthesis Using Spectrum-Progression Model and HNM
HNM: Harmonic-plus-Noise Model
Hung-Yan Gu and Chang-Yi Wu
e-mail: guhy@mail.ntust.edu.tw


ABSTRACT
In this paper, an ANN based spectrum-progression model is proposed to improve the fluency level of synthetic Mandarin speech. First, each target syllable (uttered in a sentence) is matched with its corresponding reference syllable (uttered in isolation) by using dynamic time warping. Then, each warped path, i.e. spectrum-progression path, is time normalized to have fixed dimensions, and used to train an ANN based spectrum-progression model (SPM). After training, the SPM is used together with other modules such as text analysis, prosody parameter generation, and signal sample generation to synthesize Mandarin speech. Then, the synthetic speech is used to conduct perception tests. The test results show that the SPM proposed here can indeed improve the fluency level noticeably.



(a) Each Mandarin syllable has only one recorded utterance by a female. Therefore, no chance to do unit selection.
(b) The HNM parameters analyzed from a source syllable are used to synthesize syllables of diverse prosodic characteristics.
(c) The functions, speaking rate control and timbre transformation, are implemented.
(d) Computation Platform: an Intel Core 2 T5600 1.83GHz based notebook computer running Linux over WinXP.
(e) CPU rate means the consumed cpu time divided by the time length of the synthetic voice file.


Synthetic Speech: A white-flower tree (白花樹)
text file:
text_s5
Avg. syllable duration: 300ms;  Pitch: 220Hz.
Recording of
program execution

link to synthetic syllable waveform Using spectrum-progression model and HNM.

Timbre transformation:
female => male
(VTL:100/80%, Pitch:140Hz),link to synthetic syllable waveform,link to synthetic syllable waveform
female => child(VTL:100/115%, Pitch:140Hz or 280Hz),link to synthetic syllable waveform   link to synthetic syllable waveform
small PNG image CPU rate:
13.2%
link to synthetic syllable waveform Using linear time mapping and HNM.

link to synthetic syllable waveform Using linear time mapping and PSOLA.


link to synthetic syllable waveform Direct concatenation of recorded syllables.




Synthetic Speech: The most-liked season (最喜歡的季節)
text file:
text_s3
Avg. syllable duration: 330ms;  Pitch: 220Hz.
Recording of
program execution

Using spectrum-progression model and HNM.
small PNG image CPU rate:
12.1%
Using linear time mapping and HNM.
Using linear time mapping and PSOLA.


Direct concatenation of recorded syllables.




Synthetic Speech: News reporting (新聞播報)
text file:
text_sn
Avg. syllable duration: 220ms;  Pitch: 220Hz.
Recording of
program execution
link to synthetic syllable waveform Using spectrum-progression model and HNM.

Timbre transformation:
female => male
(VTL:100/85%, Pitch:120Hz),link to synthetic syllable waveform, link to synthetic syllable waveform
small PNG image CPU rate:
17.1%
small PNG image Pitch: 120Hz
VocalTrackLen:
(100/85)%
link to synthetic syllable waveform Using linear time mapping and HNM.
link to synthetic syllable waveform Using linear time mapping and PSOLA.