Publication Details

Phoneme Recognition from a Long Temporal Context

SCHWARZ Petr, MATĚJKA Pavel and ČERNOCKÝ Jan. Phoneme Recognition from a Long Temporal Context. In: poster at JOINT AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms. Martigny: Institute for Perceptual Artificial Intelligence, 2004, pp. 1-1.

Czech title

Fonémové Rozpoznávání z Mluvené Řeči

Type

conference paper

Language

english

Authors

Schwarz Petr, Ing., Ph.D. (DCGM FIT BUT)
Matějka Pavel, Ing. (UREL FEEC BUT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)

URL

http://www.fit.vutbr.cz/~matejkap/publi/2004/ami2004.pdf PDF

Keywords

phoneme recognition, feature extraction, speech recognition

Abstract

Phoneme Recognition from a Long Temporal Context

Annotation

We investigate techniques for acoustic modeling in automatic recognition of context-independent phoneme strings. The recognizer was evaluated on TIMIT database.
The baseline phoneme recognizer is based on TempoRAl Patterns (TRAP).
It is an HMM - Neural Network (HMM/NN) hybrid.
Critical bands energies are obtained in the conventional way. Speech signal is divided into 25 ms long frames with 10 ms shift. The Mel filter-bank is emulated by triangular weighting of FFT-derived short-term spectrum to obtain short-term critical-band logarithmic spectral densities.
TRAP feature vector describes a segment of temporal evolution of critical band spectral densities within a single critical band. The central point is actual frame and there is equal number of frames in past and in future.
The length can differ. This vector forms an input to a classifier.
Outputs of the classifier are posterior probabilities of sub-word classes which we want to distinguish among. In our case, such classes are context-independent phonemes or their parts (states). Such classifier is applied in each critical band. The merger is another classifier and its function is to combine band classifier outputs into one.
Both band classifiers and merger are neural nets.
The described techniques yield phoneme probabilities for the center frame. These phoneme probabilities are then fed into a Viterbi decoder which produces phoneme strings.
This recognizer is further simplified to shorten processing times, reduce computational requirements and optimized. This simplification optimization reduce PER absolutely about 1.8%.
More precise modeling we achieved by splitting phonemes
to 3 parts (states). This improved system of 0.9% absolutely. Separate modeling of left and right phoneme context gave us 0.38% in case of one state models. More fine modeling of these left and right contexts by three states lead to improvement 3.76%. Also bi-gram language models are incorporated into the system and evaluated.
All modifications lead to a faster system with about 23.6% relative or 6.84% absolute improvement over the baseline in phoneme
error rate.
Work is in progress on porting this recognizer to meeting data domain. The recognizer will serve as
one of front-ends for the acoustic event spotting (the task of Brno within AMI).

Published

2004

Pages

1-1

Proceedings

poster at JOINT AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms

Conference

Joint AMI/PASCAL/IM2/M4 workshop, Martigny, CH

Publisher

Institute for Perceptual Artificial Intelligence

Place

Martigny, CH

BibTeX

@INPROCEEDINGS{FITPUB7648,
   author = "Petr Schwarz and Pavel Mat\v{e}jka and Jan \v{C}ernock\'{y}",
   title = "Phoneme Recognition from a Long Temporal Context",
   pages = "1--1",
   booktitle = "poster at JOINT AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms",
   year = 2004,
   location = "Martigny, CH",
   publisher = "Institute for Perceptual Artificial Intelligence",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/7648"
}