Nayak, Shekhar and Kodukula, Sri Rama Murty
(2019)
Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Processing.
PhD thesis, Indian Institute of Technology Hyderabad.
Full text not available from this repository.
(
Request a copy)
Abstract
Zero resource speech processing refers to techniques which do not require manually
transcribed speech data. The inspiration for zero resource is drawn from language acquisition
in infants which is completely self-driven. Infants learn di_erent abstraction
levels i.e. phones, words and some syntactic aspects of the language they are exposed
to, without any supervision or feedback. This motivated the research in speech community
towards the development of completely unsupervised speech algorithms which
can discover subword/word units from speech signal alone. The applications include
spoken term discovery, language identi_cation, keyword spotting etc. Zero resource
techniques can be e_ective in solving problems associated with the development of
speech systems for low resource languages.
Low resource languages have low amount of transcribed data and/or low number
of native speakers. Several languages of the world have become endangered languages
with almost negligible resources. The lack of transcribed data for low resource
languages has inspired many directions to address this problem such as data augmentation,
cross-lingual and multilingual techniques with limited success. In this thesis,
we explore better feature representations for low resource speech recognition and later
build unsupervised algorithms for zero resource speech processing which could lead
to directions to e_ective solutions to the low resource problem.
Traditional speech recognition systems employed magnitude based features for
building acoustic models. Phase of the speech signals is generally ignored as human
ear was considered traditionally to be indi_erent to phase. Recent perceptual studies
have shown the importance of phase in human speech recognition. Motivated by this
fact and in order to leverage the maximum information from limited transcribed data
available in low resource settings, we propose to extract features from the analytic
phase of speech signals for speech recognition. In order to avoid phase wrapping
problem, instantaneous frequency is extracted from the speech signal without explicit
phase computation. Di_erent instantaneous frequency estimation methods are studied
for providing e_ective features for speech recognition. Magnitude and phase based
features are used to train separate phone recognition systems. Combining magnitude
and phase based systems improves speech recognition in low resource settings and
noisy conditions.
Inspired by the recent zero resource phenomenon in speech community, the problem
of scarcity of transcribed data is addressed at more fundamental level by producing
arti_cial or virtual transcriptions only from speech signals. Motivated from infant
learning, zero resource speech processing aims at discovering acoustic word units from
speech signal alone without using any manual transcriptions or linguistic knowledge.
We propose an unsupervised speech signal to symbol transformation approach to get
virtual phones/labels from given speech signals. Syllable-like units obtained from
multiple evidences for vowel endpoint detection from speech signals are presented as
alternate units to virtual phones for signal to symbol transformation.
Several speech applications are presented which employ these virtual phones or
syllable-like units for automatically transcribing the speech data in zero resource
settings. Spoken term discovery and speaking rate estimation are achieved in zero
resource settings using the proposed methods. A completely unsupervised language
identi_cation approach is proposed and is shown to perform close to the supervised
approach. Further, a virtual phone recognition/synthesis approach based on signal
to symbol transformation is proposed for ultra low bitrate coding. Future directions
are provided to improve the low resource speech processing by employing automatic
labeling to obtain performance closer to the supervised techniques.
[error in script]
IITH Creators: |
IITH Creators | ORCiD |
---|
Kodukula, Sri Rama Murty | https://orcid.org/0000-0002-6355-5287 |
|
Item Type: |
Thesis
(PhD)
|
Uncontrolled Keywords: |
Acoustic segment modeling, Low resource, Speech recognition, Speech segmentation, Virtual phones, Zero resource,
TD1574 |
Subjects: |
Electrical Engineering |
Divisions: |
Department of Electrical Engineering |
Depositing User: |
Team Library
|
Date Deposited: |
21 Oct 2019 09:42 |
Last Modified: |
21 Oct 2019 09:42 |
URI: |
http://raiithold.iith.ac.in/id/eprint/6699 |
Publisher URL: |
|
Related URLs: |
|
Actions (login required)
|
View Item |