Článek ve sborníku konference

RATH Shakti P., POVEY Daniel, VESELÝ Karel a ČERNOCKÝ Jan. Improved Feature Processing for Deep Neural Networks. In: Proceedings of Interspeech 2013. Lyon: International Speech Communication Association, 2013, s. 109-113. ISBN 978-1-62993-443-3. ISSN 2308-457X.
Jazyk publikace:angličtina
Název publikace:Improved Feature Processing for Deep Neural Networks
Název (cs):Zlepšené zpracování příznaků pro hluboké neuronové sítě
Strany:109-113
Sborník:Proceedings of Interspeech 2013
Konference:Interspeech 2013
Místo vydání:Lyon, FR
Rok:2013
ISBN:978-1-62993-443-3
Časopis:Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013)., č. 8, Lyon, FR
ISSN:2308-457X
Vydavatel:International Speech Communication Association
URL:http://www.fit.vutbr.cz/research/groups/speech/publi/2013/rath2_interspeech2013_IS130300.pdf [PDF]
Klíčová slova
speech recognition, speaker recognition, neural networks, speaker adaptation
Anotace
V tomto článku jsme zkoumali alternativní způsoby zpracování příznaků založených na MFCC, za účelem jejich použití jako vstup pro hluboké neuronové sítě (DDNs). Článek pojednává o zlepšeném zpracování příznaků pro hluboké neuronové sítě.
Abstrakt
In this paper, we investigate alternative ways of processing MFCC-based features to use as the input to Deep Neural Networks (DNNs). Our baseline is a conventional feature pipeline that involves splicing the 13-dimensional front-end MFCCs across 9 frames, followed by applying LDA to reduce the dimension to 40 and then further decorrelation using MLLT. Confirming the results of other groups, we show that speaker adaptation applied on the top of these features using feature-space MLLR is helpful. The fact that the number of parameters of a DNN is not strongly sensitive to the input feature dimension (unlike GMM-based systems) motivated us to investigate ways to increase the dimension of the features. In this paper, we investigate several approaches to derive higher-dimensional features and verify their performance with DNN. Our best result is obtained from splicing our baseline 40-dimensional speaker adapted features again across 9 frames, followed by reducing the dimension to 200 or 300 using another LDA. Our final result is about 3% absolute better than our best GMM system, which is a discriminatively trained model.
BibTeX:
@INPROCEEDINGS{
   author = {P. Shakti Rath and Daniel Povey and Karel
	Vesel{\'{y}} and Jan {\v{C}}ernock{\'{y}}},
   title = {Improved Feature Processing for Deep Neural
	Networks},
   pages = {109--113},
   booktitle = {Proceedings of Interspeech 2013},
   journal = {Proceedings of the 14th Annual Conference of the
	International Speech Communication Association (Interspeech
	2013).},
   number = {8},
   year = {2013},
   location = {Lyon, FR},
   publisher = {International Speech Communication Association},
   ISBN = {978-1-62993-443-3},
   ISSN = {2308-457X},
   language = {english},
   url = {http://www.fit.vutbr.cz/research/view_pub.php.cs?id=10432}
}

Vaše IPv4 adresa: 3.85.143.239
Přepnout na https