Self-Supervised Phonotactic Representations for Language Identification

Ramesh, G. and Kumar, C. Shiva and Kodukula, Sri Rama Murty (2021) Self-Supervised Phonotactic Representations for Language Identification. In: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, 30 August 2021through 3 September 2021, Brno.

Text
INTERSPEECH.pdf - Published Version
Restricted to Registered users only
Download (829kB) | Request a copy

Abstract

Phonotactic constraints characterize the sequence of permissible phoneme structures in a language and hence form an important cue for language identification (LID) task. As phonotactic constraints span across multiple phonemes, the short-term spectral analysis (20-30 ms) alone is not sufficient to capture them. The speech signal has to be analyzed over longer contexts (100s of milliseconds) in order to extract features representing the phonotactic constraints. The supervised senone classifiers, aimed at modeling triphone context, have been used for extracting language-specific features for the LID task. However, it is difficult to get large amounts of manually labeled data to train the supervised models. In this work, we explore a selfsupervised approach to extract long-term contextual features for the LID task. We have used wav2vec architecture to extract contextualized representations from multiple frames of the speech signal. The contextualized representations extracted from the pre-trained wav2vec model are used for the LID task. The performance of the proposed features is evaluated on a dataset containing 7 Indian languages. The proposed self-supervised embeddings achieved 23% absolute improvement over the acoustic features and 3% absolute improvement over their supervised counterparts. Copyright © 2021 ISCA.

[error in script]

IITH Creators: