×

Kernel approximation methods for speech recognition. (English) Zbl 1489.68244

Summary: We study the performance of kernel methods on the acoustic modeling task for automatic speech recognition, and compare their performance to deep neural networks (DNNs). To scale the kernel methods to large data sets, we use the random Fourier feature method of A. Rahimi and B. Recht [“Random features for large-scale kernel machines”, in: J. Platt (ed.) et al., Advances in neural information processing systems 20. Red Hook, NY: Curran Associates, Inc. 8 p. (2007)]. We propose two novel techniques for improving the performance of kernel acoustic models. First, we propose a simple but effective feature selection method which reduces the number of random features required to attain a fixed level of performance. Second, we present a number of metrics which correlate strongly with speech recognition performance when computed on the heldout set; we attain improved performance by using these metrics to decide when to stop training. Additionally, we show that the linear bottleneck method of T. N. Sainath et al. [“Low-rank matrix factorization for deep neural network training with high-dimensional output targets”, in: Proceedings of the 2013 international IEEE conference on acoustics, speech and signal processing, ICASSP’13. Los Alamitos, CA: IEEE Computer Society. 6655–6659 (2013; doi:10.1109/ICASSP.2013.6638949)] improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Leveraging these three methods, the kernel methods attain token error rates between 0.5% better and 0.1% worse than fully-connected DNNs across four speech recognition data sets, including the TIMIT and Broadcast News benchmark tasks.

MSC:

68T10 Pattern recognition, speech recognition
62J12 Generalized linear models (logistic models)
68T07 Artificial neural networks and deep learning
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] Naman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. InSTOC, 2017. · Zbl 1369.68290
[2] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Awni Y. Hannun, Billy Jun, Tony Han, Patrick LeGresley, Xiangang Li, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Sheng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Chong Wang, Yi Wang, Zhiqian Wang, Bo Xiao, Yan Xie, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep Speech 2 : End-to-end speech recognition in English and Mandarin. InICML, 2016.
[3] Animashree Anandkumar and Rong Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. InCOLT, 2016.
[4] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. Globally normalized transition-based neural networks. InACL, 2016.
[5] Devansh Arpit, Stanislaw K. Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. InICML, 2017.
[6] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? InNIPS, 2014.
[7] Lalit R Bahl, Peter F Brown, Peter V De Souza, and Robert L Mercer. Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In ICASSP, 1986.
[8] Peter L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. InNIPS, 1996.
[9] Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards AI.Large-Scale Kernel Machines, 34(5):1-41, 2007.
[10] Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures.IEEE Trans. Neural Netw. Learning Syst., 25(8):1553-1565, 2014.
[11] L´eon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston.Large-Scale Kernel Machines. MIT Press, 2007.
[12] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. InICASSP, pages 4960-4964. IEEE, 2016.
[13] Jie Chen, Lingfei Wu, Kartik Audhkhasi, Brian Kingsbury, and Bhuvana Ramabhadran. Efficient one-vs-one kernel ridge regression for speech recognition. InICASSP, 2016.
[14] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. State-of-the-art speech recognition with sequence-to-sequence models. InICASSP, 2018.
[15] Anna Choromanska, Mikael Henaff, Micha¨el Mathieu, G´erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. InAISTATS, 2015.
[16] Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm.ACM Trans. Algorithms, 6(4):63:1-63:30, 2010. · Zbl 1300.90026
[17] George Cybenko. Approximation by superpositions of a sigmoidal function.MCSS, 2(4): 303-314, 1989. · Zbl 0679.94019
[18] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition.IEEE Trans. Audio, Speech & Language Processing, 20(1):30-42, 2012.
[19] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. InNIPS, 2014.
[20] Yann N. Dauphin, Razvan Pascanu, C¸ aglar G¨ul¸cehre, KyungHyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. InNIPS, 2014.
[21] Dennis DeCoste and Bernhard Sch¨olkopf.Training invariant support vector machines. Machine Learning, 46(1-3):161-190, 2002. · Zbl 0998.68102
[22] Najim Dehak, Patrick Kenny, R´eda Dehak, Pierre Dumouchel, and Pierre Ouellet. Frontend factor analysis for speaker verification.IEEE Trans. Audio, Speech & Language Processing, 19(4):788-798, 2011.
[23] John C. Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting.Journal of Machine Learning Research, 10:2899-2934, 2009. · Zbl 1235.62151
[24] Jonathan Fiscus, George Doddington, Audrey Le, Greg Sanders, Mark Przybocki, and David Pallett. 2003 NIST Rich Transcription evaluation data.Linguistic Data Consortium, 2003. URLhttps://catalog.ldc.upenn.edu/LDC2007S10.
[25] Mark J. F. Gales. Maximum likelihood linear transformations for HMM-based speech recognition.Computer Speech & Language, 12(2):75-98, 1998.
[26] Mark J. F. Gales and Steve J. Young. The application of hidden Markov models in speech recognition.Foundations and Trends in Signal Processing, 1(3):195-304, 2007. · Zbl 1145.68045
[27] John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993. URLhttps://catalog.ldc.upenn.edu/LDC93S1.
[28] May, Bagheri Garakani, Lu, Guo, Liu, Bellet, Fan, Collins, Hsu, Kingsbury, Picheny, and Sha
[29] Matthew Gibson and Thomas Hain. Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition. InINTERSPEECH, 2006.
[30] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InAISTATS, 2010.
[31] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville.Deep Learning. Adaptive computation and machine learning. MIT Press, 2016. · Zbl 1373.68009
[32] Alex Graves, Santiago Fern´andez, Faustino J. Gomez, and J¨urgen Schmidhuber.Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. InICML, 2006.
[33] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps. InICML, 2014.
[34] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural network. InNIPS, 2015.
[35] Wolfgang Karl H¨ardle, Marlene M¨uller, Stefan Sperlich, and Axel Werwatz.Nonparametric and semiparametric models. Springer Science & Business Media, 2004. · Zbl 1059.62032
[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016.
[37] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal Processing Magazine, 29, 2012.
[38] Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators.Neural Networks, 2(5):359-366, 1989. · Zbl 1383.92015
[39] Po-Sen Huang, Haim Avron, Tara N. Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel methods match deep neural networks on TIMIT. InICASSP, 2014.
[40] G.J.O. Jameson. A simple proof of Stirling’s formula for the gamma function.The Mathematical Gazette, 99(544):68-74, 2015. · Zbl 1384.33004
[41] Janez Kaiser, Bogomir Horvat, and Zdravko Kacic. A novel loss function for the overall risk criterion based discriminative training of HMM models. InINTERSPEECH, 2000. · Zbl 1005.68819
[42] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. In AISTATS, 2012.
[43] Brian Kingsbury. Lattice-based optimization of sequence classification criteria for neuralnetwork acoustic modeling. InICASSP, 2009.
[44] Brian Kingsbury, Jia Cui, Xiaodong Cui, Mark J. F. Gales, Kate Knill, Jonathan Mamou, Lidia Mangu, David Nolden, Michael Picheny, Bhuvana Ramabhadran, Ralf Schl¨uter, Abhinav Sethy, and Philip C. Woodland. A high-performance Cantonese keyword search system. InICASSP, 2013.
[45] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. InNIPS, 2012. · Zbl 1318.68153
[46] Quoc V. Le, Tam´as Sarl´os, and Alexander J. Smola. Fastfood - computing Hilbert space expansions in loglinear time. InICML, 2013.
[47] Zhiyun Lu, Dong Guo, Alireza Bagheri Garakani, Kuan Liu, Avner May, Aur´elien Bellet, Linxi Fan, Michael Collins, Brian Kingsbury, Michael Picheny, and Fei Sha. A comparison between deep neural nets and kernel acoustic models for speech recognition. InICASSP, 2016. · Zbl 1489.68244
[48] Avner May, Michael Collins, Daniel J. Hsu, and Brian Kingsbury. Compact kernel models for acoustic modeling via random feature selection. InICASSP, 2016.
[49] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels.Journal of Machine Learning Research, 7:2651-2667, 2006. · Zbl 1222.68266
[50] Tomas Mikolov, Martin Karafi´at, Luk´as Burget, Jan Cernock´y, and Sanjeev Khudanpur. Recurrent neural network based language model. InINTERSPEECH, 2010.
[51] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. InICLR Workshop, 2013.
[52] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey E. Hinton. Acoustic modeling using deep belief networks.IEEE Trans. Audio, Speech & Language Processing, 20(1):14-22, 2012.
[53] Guido F. Mont´ufar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. InNIPS, 2014.
[54] N. Morgan and H. Bourlard. Generalization and parameter estimation in feedforward nets: Some experiments. InNIPS, 1990.
[55] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. InICLR (Workshop), 2015.
[56] Jeffrey Pennington and Yasaman Bahri.Geometry of neural network loss surfaces via random matrix theory. InICML, 2017.
[57] Jeffrey Pennington, Felix X. Yu, and Sanjiv Kumar. Spherical random features for polynomial kernels. InNIPS, 2015.
[58] John C. Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization. InAdvances in Kernel Methods - Support Vector Learning. MIT Press, 1998.
[59] Daniel Povey and Brian Kingsbury. Evaluation of proposed modifications to MPE for large scale discriminative training. InICASSP, 2007.
[60] Daniel Povey and Philip C. Woodland. Minimum phone error and I-smoothing for improved discriminative training. InICASSP, 2002.
[61] May, Bagheri Garakani, Lu, Guo, Liu, Bellet, Fan, Collins, Hsu, Kingsbury, Picheny, and Sha
[62] Daniel Povey, Dimitri Kanevsky, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, and Karthik Visweswariah. Boosted MMI for model and feature-space discriminative training. InICASSP, 2008.
[63] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur.Purely sequence-trained neural networks for ASR based on lattice-free MMI. InINTERSPEECH, 2016.
[64] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InNIPS, 2007.
[65] Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran, Petr Fousek, Petr Nov´ak, and Abdel-rahman Mohamed. Making deep belief networks effective for large vocabulary continuous speech recognition. InASRU, 2011.
[66] Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank matrix factorization for deep neural network training with highdimensional output targets. InICASSP, 2013a.
[67] Tara N. Sainath, Brian Kingsbury, Hagen Soltau, and Bhuvana Ramabhadran. Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans. Audio, Speech & Language Processing, 21(11):2267-2276, 2013b.
[68] Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep convolutional neural networks for LVCSR. InICASSP, 2013c.
[69] Hasim Sak, Andrew W. Senior, and Fran¸coise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. InINTERSPEECH, 2014.
[70] George Saon, Tom Sercu, Steven J. Rennie, and Hong-Kwang Jeff Kuo. The IBM 2016 English conversational telephone speech recognition system. InINTERSPEECH, 2016.
[71] George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, and Phil Hall. English conversational telephone speech recognition by humans and machines. InINTERSPEECH, 2017.
[72] B. Sch¨olkopf and A. Smola.Learning with kernels. MIT Press, 2002.
[73] Frank Seide, Gang Li, Xie Chen, and Dong Yu. Feature engineering in context-dependent deep neural networks for conversational speech transcription. InASRU, 2011a.
[74] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using contextdependent deep neural networks. InINTERSPEECH, 2011b.
[75] Tom Sercu and Vaibhava Goel. Advances in very deep convolutional neural networks for LVCSR. InINTERSPEECH, 2016.
[76] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. InICLR, 2015.
[77] Hagen Soltau, George Saon, and Brian Kingsbury. The IBM Attila speech recognition toolkit. InSLT, 2010.
[78] Hagen Soltau, George Saon, and Tara N. Sainath. Joint training of convolutional and non-convolutional neural networks. InICASSP, 2014.
[79] S¨oren Sonnenburg and Vojtech Franc. COFFIN: a computational framework for linear SVMs. InICML, 2010.
[80] Ingo Steinwart. Sparseness of support vector machines—some asymptotically sharp bounds. InNIPS, 2003. · Zbl 1094.68082
[81] Nikko Str¨om. Sparse connection and pruning in large dynamic artificial neural networks. InEUROSPEECH, 1997.
[82] Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney. LSTM neural networks for language modeling. InINTERSPEECH, 2012.
[83] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. InNIPS, 2014.
[84] Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core vector machines: Fast SVM training on very large data sets.Journal of Machine Learning Research, 6:363-392, 2005. · Zbl 1222.68320
[85] V. Valtchev, J. J. Odell, Philip C. Woodland, and Steve J. Young. MMIE training of large vocabulary recognition systems.Speech Communication, 22(4):303-314, 1997.
[86] Ewout van den Berg, Bhuvana Ramabhadran, and Michael Picheny. Training variance and performance evaluation of neural networks in speech. InICASSP, 2017.
[87] Andrea Vedaldi and Andrew Zisserman. Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell., 34(3):480-492, 2012.
[88] Karel Vesel´y, Arnab Ghoshal, Luk´as Burget, and Daniel Povey. Sequence-discriminative training of deep neural networks. InINTERSPEECH, 2013.
[89] Christopher K. I. Williams and Matthias W. Seeger. Using the Nystr¨om method to speed up kernel machines. InNIPS, 2000.
[90] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. InAISTATS, 2017.
[91] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Michael L. Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. Toward human parity in conversational speech recognition.IEEE/ACM Trans. Audio, Speech & Language Processing, 25(12):2410-2423, 2017.
[92] Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with singular value decomposition. InINTERSPEECH, 2013.
[93] May, Bagheri Garakani, Lu, Guo, Liu, Bellet, Fan, Collins, Hsu, Kingsbury, Picheny, and Sha
[94] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alexander J. Smola, Le Song, and Ziyu Wang. Deep fried convnets. InICCV, 2015.
[95] Ian En-Hsu Yen, Ting-Wei Lin, Shou-De Lin, Pradeep Ravikumar, and Inderjit S. Dhillon. Sparse random feature algorithm as coordinate descent in Hilbert space. InNIPS, 2014.
[96] Felix X. Yu, Sanjiv Kumar, Henry A. Rowley, and Shih-Fu Chang. Compact nonlinear maps and circulant extensions.arXiv preprint arXiv:1503.03893, 2015.
[97] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.