Rout, K
(2014)
Spoken Term Detection in Continuous Speech.
Masters thesis, Indian Institute of Technology Hyderabad.
Full text not available from this repository.
(
Request a copy)
Abstract
This thesis aims a speaker independent spoken term detection (STD) using supervised technique.
The goal of STD is to retrieve the occurrence of the user-spoken-term from the given speech database.
An MLP is trained in a supervised manner using labeled speech data from a large number of speakers.
The trained multi-layer perceptron (MLP) is used to generate phoneme posterior features, i.e.,
conditional probability of each phoneme for every frame in the speech utterance. The dimension
of the posterior feature depends on the number of phoneme classes considered during training.
The sequence posterior features obtained from the test utterance are matched with those obtained
from query word using subsequence dynamic-time warping (subDTW). The distance along the bestaligned
path is used to make decision on presence/absence of the query word in the given test
utterance. The performance of the proposed method is evaluated on Telugu broadcast news database
collected from several television channels. It is observed that performance of posterior features is
signicantly better than the conventional mel-frequency cepstral coecients (MFCCs) features.
A comparison study is done using both supervised and unsupervised techniques. The performance
of the supervised methods like MLP improves signicant amount compared to unsupervised methods
like Gaussian Mixture Model (GMM). Performance accuracy of the STD is signicantly improved
by supervised method compared to unsupervised method. Eects of two kinds of query words are
analyzed - those recorded in isolation and those cut out from continuous speech. As the duration
of the phonemes in the query word greatly vary between these two mode, the sequence matching
technique subDTW plays an important role to nd the true hits. This can be achieved by taking
dierent local weights in subDTW for dierent recording modes. Experiments are conducted with
respect to the query words recorded in isolated manner and words cut out from continuous speech.
It is found that the isolated query detection performed worse than detection of query cut out of
continuous speech, owing to the channels mismatch and lack of disparities in terms of number of
frames.
Actions (login required)
|
View Item |