Speaker embedding extraction with virtual phonetic information

Sreekanth, S and Rafi, B Shaik Mohammad and Kodukula, Sri Rama Murty and et al, . (2019) Speaker embedding extraction with virtual phonetic information. In: 7th IEEE Global Conference on Signal and Information Processing, GlobalSIP, 11-14 November 2019, Ottawa,Canada.

Full text not available from this repository. (Request a copy)

Abstract

In the recent past, deep neural networks have been successfully employed to extract fixed-dimensional speaker embeddings from the speech signal. The commonly used x-vectors are extracted by projecting the magnitude spectral features extracted from the speech signal onto a speaker-discriminative space. As the x-vectors do not explicitly capture the speaker-specific phonological pronunciation variability, phonetic vectors extracted from an automatic speech recognition (ASR) engine were supplied as auxiliary information to improve the performance of the x-vector system. However, the development of ASR engine requires a huge amount of manually transcribed speech data. In this paper, we propose to transcribe the speech signal in an unsupervised manner with the cluster labels obtained from a mixture of autoencoders (MoA) trained on a large amount of speech data. The unsupervised labels, referred to as virtual phonetic transcriptions, are used to extract the phonetic vectors. The virtual phonetic vectors extracted using MoA are supplied as auxiliary information to the x-vector system. The performance of the proposed system is compared with the state-of-the-art x-vector system on NIST SRE-2010 data. The proposed unsupervised auxiliary information provides a relative improvement of 12.08%, 3.61% and 16.66% over the x-vector system on core-core, core-10sec and 10sec-10sec conditions, respectively.

[error in script]

IITH Creators: