Publication Details

Phoneme Recognition from a Long Temporal Context

SCHWARZ Petr, MATĚJKA Pavel and ČERNOCKÝ Jan. Phoneme Recognition from a Long Temporal Context. In: poster at JOINT AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms. Martigny: Institute for Perceptual Artificial Intelligence, 2004, pp. 1-1.
Czech title
Fonémové Rozpoznávání z Mluvené Řeči
Type
conference paper
Language
english
Authors
Schwarz Petr, Ing., Ph.D. (DCGM FIT BUT)
Matějka Pavel, Ing. (UREL FEEC BUT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)
URL
Keywords

phoneme recognition, feature extraction, speech recognition

Abstract

Phoneme Recognition from a Long Temporal Context

Annotation

We investigate techniques for acoustic modeling in automatic recognition of context-independent phoneme strings. The recognizer was evaluated on TIMIT database.
The baseline phoneme recognizer is based on TempoRAl Patterns (TRAP).
It is an HMM - Neural Network (HMM/NN) hybrid.
Critical bands energies are obtained in the conventional way. Speech signal is divided into 25 ms long frames with 10 ms shift. The Mel filter-bank is emulated by triangular weighting of FFT-derived short-term spectrum to obtain short-term critical-band logarithmic spectral densities.
TRAP feature vector describes a segment of temporal evolution of critical band spectral densities within a single critical band. The central point is actual frame and there is equal number of frames in past and in future.
The length can differ. This vector forms an input to a classifier.
Outputs of the classifier are posterior probabilities of sub-word classes which we want  to distinguish among. In our case, such classes are context-independent phonemes or  their parts (states). Such classifier is applied in each critical band. The merger is  another classifier and its function is to combine band classifier outputs into one.
 Both band classifiers and merger are neural nets.
 The described techniques yield phoneme probabilities for the center frame. These  phoneme probabilities are then fed into a Viterbi decoder which produces   phoneme strings.
This recognizer is further simplified to shorten processing times, reduce computational requirements and optimized. This simplification optimization reduce PER absolutely about 1.8%.
More precise modeling we achieved by splitting phonemes
to 3 parts (states). This improved system of 0.9% absolutely. Separate modeling of left and right phoneme context gave us 0.38% in case of one state models. More fine modeling of these left and right contexts by three states lead to improvement 3.76%. Also bi-gram language models are incorporated into the system and evaluated.
All modifications lead to a faster system with about 23.6% relative or 6.84% absolute improvement over the baseline in phoneme
error rate.
Work is in progress on porting this recognizer to meeting data domain. The recognizer will serve as
one of front-ends for the acoustic event spotting (the task of Brno within AMI).

Published
2004
Pages
1-1
Proceedings
poster at JOINT AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms
Conference
Joint AMI/PASCAL/IM2/M4 workshop, Martigny, CH
Publisher
Institute for Perceptual Artificial Intelligence
Place
Martigny, CH
BibTeX
@INPROCEEDINGS{FITPUB7648,
   author = "Petr Schwarz and Pavel Mat\v{e}jka and Jan \v{C}ernock\'{y}",
   title = "Phoneme Recognition from a Long Temporal Context",
   pages = "1--1",
   booktitle = "poster at JOINT AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms",
   year = 2004,
   location = "Martigny, CH",
   publisher = "Institute for Perceptual Artificial Intelligence",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/7648"
}
Back to top