Unsupervised Acoustic Segmentation and Clustering Using Siamese Network Embeddings

Bhati, Saurabhchand and Nayak, Shekhar and Kodukula, Sri Rama Murty and Dehak, Najim (2019) Unsupervised Acoustic Segmentation and Clustering Using Siamese Network Embeddings. In: 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH, 15-19 September 2019, Graz, Austria.

Full text not available from this repository. (Request a copy)

Abstract

Unsupervised discovery of acoustic units from the raw speech signal forms the core objective of zero-resource speech processing. It involves identifying the acoustic segment boundaries and consistently assigning unique labels to acoustically similar segments. In this work, the possible candidates for segment boundaries are identified in an unsupervised manner from the kernel Gram matrix computed from the Mel-frequency cepstral coefficients (MFCC). These segment boundary candidates are used to train a siamese network, that is intended to learn embeddings that minimize intrasegment distances and maximize the intersegment distances. The siamese embeddings capture phonetic information from longer contexts of the speech signal and enhance the intersegment discriminability. These properties make the siamese embeddings better suited for acoustic segmentation and clustering than the raw MFCC features. The Gram matrix computed from the siamese embeddings provides unambiguous evidence for boundary locations. The initial candidate boundaries are refined using this evidence, and siamese embeddings are extracted for the new acoustic segments. A graph growing approach is used to cluster the siamese embeddings, and a unique label is assigned to acoustically similar segments. The performance of the proposed method for acoustic segmentation and clustering is evaluated on Zero Resource 2017 database.

[error in script]

IITH Creators: