Title:

Digital Speech Processing

Code:CZR
Ac.Year:ukončen 2005/2006
Sem:Summer
Language of Instruction:Czech
Credits:5
Completion:examination
Type of
instruction:
Hour/semLecturesSeminar
Exercises
Laboratory
Exercises
Computer
Exercises
Other
Hours:26201212
 ExamsTestsExercisesLaboratoriesOther
Points:502501213
Guarantor:Černocký Jan, doc. Dr. Ing. (DCGM)
Lecturer:Černocký Jan, doc. Dr. Ing. (DCGM)
Faculty:Faculty of Information Technology BUT
Department:Department of Computer Graphics and Multimedia FIT BUT
 
Learning objectives:
  To provide students with the knowledge of basic characteristics of speech signal in relation to production and hearing of speech by humans. To describe basic algorithms of speech analysis common to many applications. To give an overview of applications (recognition, synthesis, coding) and to inform about practical aspects of speech algorithms implementation.
Description:
  Applications of speech processing, digital processing of speech signals, production and perception of speech, introduction to phonetics, pre-processing and basic parameters of speech, linear-predictive model, cepstrum, fundamental frequency estimation, coding (time domain and vocoders), recognition (DTW and HMM), synthesis. Software and libraries for speech processing.
Knowledge and skills required for the course:
  Basic knowledge of signal processing.
Subject specific learning outcomes and competencies:
  Students will get familiar with principal methods and algorithms of speech signal processing. They will be able to design a simple system for speech processing (speech activity detector, recognizer of limited number of isolated words), including its implementation into application programs.
Generic learning outcomes and competencies:
  The students will deepen their knowledge in signal processing. The will acquire new skills in math- and visualization-SW Matlab and in practical use of C/C++. During projects, they will get acquainted with independent development work.
Syllabus of lectures:
 
  1. Organization of the course, applications, sciences related to the topic, information carried by speech, demonstrations.
  2. Digital processing of speech signals: recording - sampling, quantization. Speech spectra - continuous Fourier transform; what do we get when we sample. Discrete Fourier transform. Random signals, power spectral density. Modification of speech - linear filters. Frequency response of a filter.
  3. Pre-processing of speech: dc removal, preemphasis, frames, basic parameters. Spectrogram. Speech production: articulatory organs - vocal cords and vocal tract vs. excitation and filter. Characteristics in time and frequency, influence of excitation and filter. What can be seen on long- and short-term spectrograms. How to separate excitation and filter: cepstrum, MFCC.
  4. Linear-predictive model: what is it good for ? Separation of vocal tract characteristics from excitation - applications in coding and recognition. Prediction of a sample from past samples - linear prediction (LP). Error of LP. Obtaining the error using a single filter. Determination of vocal tract characteristics using LP analysis. Spectrum estimated by LP. Features derived from LP - LAR and LSF. LPC-cepstrum.
  5. Determination of fundamental frequency (F0). Terminology. Characteristics of F0 of males, females and kids. Use in speech processing systems . Methods based on autocorrelation function. NCCF. Long-term predictor and cepstral analysis for F0 determination. Reliability and problems of F0 detectors.
  6. Coding I.: Aims of coding. Bit-rate, objective and subjective measurements of quality. Classification of coders according to bit-rate. Waveform coders. Vocoders - LPC. Vector quantization in speech coding.
  7. Coding II. - CELP, Coding in GSM networks: GSM, GSM-EFR, GSM-HR, Voice over IP. Introduction to speech recognition - the task, classification of recognizers: isolated words - connected words - continuous speech, speaker dependent - speaker independent. Basic function blocks. Voice activity detection (VAD) for isolated words.
  8. Recognition using DTW. Recognition based on distance of speech frames - various definitions of distance. Timing: linear modification, dynamic programming (Dynamic Time Warping DTW). Hidden Markov models (HMM I.): Introduction, motivations and relation to DTW. Structure f the model, Gaussian distributions, state sequences.
  9. HMM II. probability of a sequence of states, Baum-Welch and Viterbi probabilities. Training of models: Baum-Welch, recognition: Viterbi. Token passing. Connected words.
  10. HMM III. Continuous speech with large vocabulary: recognition of small units - phonemes... Phonetics: vowels and consonants, characteristics, classification of phonemes. International phoneme alphabets: IPA, SAMPA, TIMIT. Co-articulation. Applications in recognition: context-dependent triphones. Large vocabulary, Language modeling, lattice rescoring, forced alignment [Martin Karafiát].
  11. Features for recognition [Lukáš Burget, Petr Schwarz, Pavel Matějka]. What do we need: suppression of pitch, de-correlation, link with spectral envelope. How do we reach it: LPCC, MFCC, de-correlation: PCA, LDA, HLDA, channel robustness: normalization. Further tricks with features - delta, delta-delta. "Hot-topics" in feature extraction: TRAPs a FeatureNet, neural nets. Tools for speech processing.
  12. Speech synthesis: structure of the synthesizer. Conversion of written text to speech: text-to-speech. Text normalization. Prosody (melody, accents, timing) in synthesis. Units for synthesis - manual and automatic selection, corpus-based synthesis. Generation of signal in time and frequency domains: PSOLA and HNM. Applications, SW for synthesis: EPOS, MBROLA, Festival.
  13. Further topics in speech processing:
    • speaker identification/verification (principles, false acceptation, false rejection, cost function, optimal operation point, EER). [Černocký].
    • Phoneme recognition [Petr Schwarz, Petr Jenderka]
    • LVCSR [Martin Karafiát]
    • Recognizer merging [Lukáš Burget]
    • Very Low Bit Rate coding [Petr Motlíček, Černocký]
    • audio-video recognition [Petr Motlíček]
    • speech databases [Černocký].
Syllabus of numerical exercises:
 Numerical exercise 3hrs: digital filter, LPC, DTW, HMM, spectrogram reading.
Syllabus of laboratory exercises:
 
  1. Speech processing in Matlab: reading/writing of speech files, basic operations, recording of speech.
  2. Signal processing in Matlab: design of filter, poles, zeros, frequency response, filtering, spectral analysis: FT, PSD.
  3. Speech in C - class for input of speech. PROJECT 1: Simple frequency analyzer using FFT (will be supplied), output using ASCII characters, height of a column corresponds to energy in a frequency band.
  4. LPC in C: Correlation, Levinson and Durbin, short-term energy. Check with Matlab on a speech file. Preparation for coding - storing to well-defined structure.
  5. NCCF and fundamental frequency detection, Matlab, C. Threshold determination. Storing to the structure. Advanced: median smoothing of estimates.
  6. PROJECT 2: - full LPC coder and decoder in C (without quantization of parameters). Advanced: speech output on-line using OSS (self-study).
  7. Preparation for recognition: LPCC, voice activity detection, storing of speech files (samples and features for training of HMM and as references for DTW), preparation for the calling of recognizer.
  8. PROJECT 3: full on-line recognizer based on DTW.
  9. HTK - creation of small database of numerals, work with HMMs in HTK: prototypes, training, recognition, evaluation. Models must to be stored (will be needed for project No.4).
  10. Preparation for HMM recognition: experience with decoder written by Lukas Burget - reading of models. Function for MFCC computation will be supplied, check with HTK.
  11. PROJECT 4: HMM recognizer: writing of code for computation of output probability and Viterbi decoder using token-passing. Interfacing with voice-activity detector. Advanced: multi-threading (first thread records, second extracts features, third performs VAD, fourth recognizes).
  12. Synthesis: database with phone-labels available, synthesis from text using concatenation. Advanced: using of HNM synthesis.
Syllabus of computer exercises:
 
  1. Speech processing in Matlab: reading/writing of speech files, basic operations, recording of speech.
  2. Signal processing in Matlab: design of filter, poles, zeros, frequency response, filtering, spectral analysis: FT, PSD.
  3. Speech in C - class for input of speech. PROJECT 1: Simple frequency analyzer using FFT (will be supplied), output using ASCII characters, height of a column corresponds to energy in a frequency band.
  4. LPC in C: Correlation, Levinson and Durbin, short-term energy. Check with Matlab on a speech file. Preparation for coding - storing to well-defined structure.
  5. NCCF and fundamental frequency detection, Matlab, C. Threshold determination. Storing to the structure. Advanced: median smoothing of estimates.
  6. PROJECT 2: - full LPC coder and decoder in C (without quantization of parameters). Advanced: speech output on-line using OSS (self-study).
  7. Preparation for recognition: LPCC, voice activity detection, storing of speech files (samples and features for training of HMM and as references for DTW), preparation for the calling of recognizer.
  8. PROJECT 3: full on-line recognizer based on DTW.
  9. HTK - creation of small database of numerals, work with HMMs in HTK: prototypes, training, recognition, evaluation. Models must to be stored (will be needed for project No.4).
  10. Preparation for HMM recognition: experience with decoder written by Lukas Burget - reading of models. Function for MFCC computation will be supplied, check with HTK.
  11. PROJECT 4: HMM recognizer: writing of code for computation of output probability and Viterbi decoder using token-passing. Interfacing with voice-activity detector. Advanced: multi-threading (first thread records, second extracts features, third performs VAD, fourth recognizes).
  12. Synthesis: database with phone-labels available, synthesis from text using concatenation. Advanced: using of HNM synthesis.
Syllabus - others, projects and individual work of students:
 see the program of computer labs.
Fundamental literature:
 
  1. Psutka, J.: Komunikace s s počítačem mluvenou řečí. Academia, Praha, 1995. (in Czech, available in FIT library).
  2. Gold, B., Morgan, N.: Speech and audio signal processing, John Wiley and Sons, 2000. (available in FIT library).
  3. Young, S., Jansen, J., Odell, J., Ollason, D., Woodland, P.:  The HTK book, Entropics Cambridge Research Lab., 1996, Cambridge, UK. Excellent introduction to HMMs, free download at http://htk.eng.cam.ac.uk/
  4. http://www.fit.vutbr.cz/~cernocky/speech/ - lecture notes, labs, functions. This page's going to grow...
  5. http://www.fit.vutbr.cz/~cernocky/oldspeech/ - lecture notes, labs, functions. Old version, but especially some labs (everything in Matlab) might be interesting.
Study literature:
 
  • Krčmová, N.: Fonetika a fonologie: zvuková stavba současné češtiny. ISBN 80-210-0137-2. Masarykova univerzita, Brno, 1990
  • Rabiner, L. Juang, B.H.: Fundamentals of speech recognition, Signal Processing, Prentice Hall, Engelwood Cliffs, NJ, 1993
  • Rabiner, L.R., Schaeffer, L.W.: Digital processing of speech signals, Prentice Hall, 1978
Links:
 http://www.fit.vutbr.cz/~cernocky/speech/
Progress assessment:
  
  1. 4 projects a 8 pts. - 32
  2. mid-semestral exam - theoretical questions only - 18
  3. semestral exam - theory and numerical examples - 50
  • All materials is authorized for both exams.
  • Projects: for each project, software and short documentation (how to compile, how to run, which algorithms are used) should be supplied.
Passing bounary for ECTS assessment - 50 points
 

Your IPv4 address: 18.208.211.150
Switch to https

DNSSEC [dnssec]