Virtual phone discovery for speech synthesis without text

Nayak, Shekhar and Kumar, C Shiva and Kodukula, Sri Rama Murty and et al, . (2020) Virtual phone discovery for speech synthesis without text. In: 7th IEEE Global Conference on Signal and Information Processing, GlobalSIP, 11-14 November 2019, Ottawa,Canada.

Full text not available from this repository. (Request a copy)

Abstract

The objective of this work is to re-synthesize speech directly from the speech signals without using any text in a different speaker's voice. The speech signals are transformed into a sequence of acoustic subword units or virtual phones which are discovered automatically from the given speech signals in an unsupervised manner. The speech signal is initially segmented into acoustically homogeneous segments through kernel-Gram segmentation using MFCC and autoencoder bottleneck features. These segments are then clustered using different clustering techniques. The cluster labels thus obtained are considered as virtual phone units which are used to transcribe the speech signals. The virtual phones for the utterances to be resynthesized are encoded as one-hot vector sequences. Deep neural network based duration model and acoustic model are trained for synthesis using these sequences. A vocoder is used to synthesize speech in target speaker's voice from the features estimated by the acoustic model. The performance evaluation is done on ZeroSpeech 2019 challenge on English and Indonesian language. The bitrate and speaker similarity were found to be better than the challenge baseline with slightly lower intelligibility due to the compact encoding.

[error in script]

IITH Creators: