Conference paper

RATH Shakti P., POVEY Daniel, VESELÝ Karel and ČERNOCKÝ Jan. Improved Feature Processing for Deep Neural Networks. In: Proceedings of Interspeech 2013. Lyon: International Speech Communication Association, 2013, pp. 109-113. ISBN 978-1-62993-443-3. ISSN 2308-457X.
Publication language:english
Original title:Improved Feature Processing for Deep Neural Networks
Title (cs):Zlepšené zpracování příznaků pro hluboké neuronové sítě
Pages:109-113
Proceedings:Proceedings of Interspeech 2013
Conference:Interspeech 2013
Place:Lyon, FR
Year:2013
ISBN:978-1-62993-443-3
Journal:Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013)., No. 8, Lyon, FR
ISSN:2308-457X
Publisher:International Speech Communication Association
URL:http://www.fit.vutbr.cz/research/groups/speech/publi/2013/rath2_interspeech2013_IS130300.pdf [PDF]
Keywords
speech recognition, speaker recognition, neural networks, speaker adaptation
Annotation
In this paper, we explore various methods of providing higherdimensional features to DNNs, while still applying speaker adaptation with fMLLR of low dimensionality.
Abstract
In this paper, we investigate alternative ways of processing MFCC-based features to use as the input to Deep Neural Networks (DNNs). Our baseline is a conventional feature pipeline that involves splicing the 13-dimensional front-end MFCCs across 9 frames, followed by applying LDA to reduce the dimension to 40 and then further decorrelation using MLLT. Confirming the results of other groups, we show that speaker adaptation applied on the top of these features using feature-space MLLR is helpful. The fact that the number of parameters of a DNN is not strongly sensitive to the input feature dimension (unlike GMM-based systems) motivated us to investigate ways to increase the dimension of the features. In this paper, we investigate several approaches to derive higher-dimensional features and verify their performance with DNN. Our best result is obtained from splicing our baseline 40-dimensional speaker adapted features again across 9 frames, followed by reducing the dimension to 200 or 300 using another LDA. Our final result is about 3% absolute better than our best GMM system, which is a discriminatively trained model.
BibTeX:
@INPROCEEDINGS{
   author = {P. Shakti Rath and Daniel Povey and Karel Vesel{\'{y}} and
	Jan {\v{C}}ernock{\'{y}}},
   title = {Improved Feature Processing for Deep Neural Networks},
   pages = {109--113},
   booktitle = {Proceedings of Interspeech 2013},
   journal = {Proceedings of the 14th Annual Conference of the
	International Speech Communication Association (Interspeech
	2013).},
   number = {8},
   year = {2013},
   location = {Lyon, FR},
   publisher = {International Speech Communication Association},
   ISBN = {978-1-62993-443-3},
   ISSN = {2308-457X},
   language = {english},
   url = {http://www.fit.vutbr.cz/research/view_pub.php?id=10432}
}

Your IPv4 address: 54.159.252.103
Switch to IPv6 connection

DNSSEC [dnssec]