Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Processing

Nayak, Shekhar and Kodukula, Sri Rama Murty (2019) Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Processing. PhD thesis, Indian Institute of Technology Hyderabad.

Full text not available from this repository. (Request a copy)

Abstract

Zero resource speech processing refers to techniques which do not require manually transcribed speech data. The inspiration for zero resource is drawn from language acquisition in infants which is completely self-driven. Infants learn di_erent abstraction levels i.e. phones, words and some syntactic aspects of the language they are exposed to, without any supervision or feedback. This motivated the research in speech community towards the development of completely unsupervised speech algorithms which can discover subword/word units from speech signal alone. The applications include spoken term discovery, language identi_cation, keyword spotting etc. Zero resource techniques can be e_ective in solving problems associated with the development of speech systems for low resource languages. Low resource languages have low amount of transcribed data and/or low number of native speakers. Several languages of the world have become endangered languages with almost negligible resources. The lack of transcribed data for low resource languages has inspired many directions to address this problem such as data augmentation, cross-lingual and multilingual techniques with limited success. In this thesis, we explore better feature representations for low resource speech recognition and later build unsupervised algorithms for zero resource speech processing which could lead to directions to e_ective solutions to the low resource problem. Traditional speech recognition systems employed magnitude based features for building acoustic models. Phase of the speech signals is generally ignored as human ear was considered traditionally to be indi_erent to phase. Recent perceptual studies have shown the importance of phase in human speech recognition. Motivated by this fact and in order to leverage the maximum information from limited transcribed data available in low resource settings, we propose to extract features from the analytic phase of speech signals for speech recognition. In order to avoid phase wrapping problem, instantaneous frequency is extracted from the speech signal without explicit phase computation. Di_erent instantaneous frequency estimation methods are studied for providing e_ective features for speech recognition. Magnitude and phase based features are used to train separate phone recognition systems. Combining magnitude and phase based systems improves speech recognition in low resource settings and noisy conditions. Inspired by the recent zero resource phenomenon in speech community, the problem of scarcity of transcribed data is addressed at more fundamental level by producing arti_cial or virtual transcriptions only from speech signals. Motivated from infant learning, zero resource speech processing aims at discovering acoustic word units from speech signal alone without using any manual transcriptions or linguistic knowledge. We propose an unsupervised speech signal to symbol transformation approach to get virtual phones/labels from given speech signals. Syllable-like units obtained from multiple evidences for vowel endpoint detection from speech signals are presented as alternate units to virtual phones for signal to symbol transformation. Several speech applications are presented which employ these virtual phones or syllable-like units for automatically transcribing the speech data in zero resource settings. Spoken term discovery and speaking rate estimation are achieved in zero resource settings using the proposed methods. A completely unsupervised language identi_cation approach is proposed and is shown to perform close to the supervised approach. Further, a virtual phone recognition/synthesis approach based on signal to symbol transformation is proposed for ultra low bitrate coding. Future directions are provided to improve the low resource speech processing by employing automatic labeling to obtain performance closer to the supervised techniques.

[error in script]

IITH Creators: