zbMATH — the first resource for mathematics

On kernel method-based connectionist models and supervised deep learning without backpropagation. (English) Zbl 07268864
Summary: We propose a novel family of connectionist models based on kernel machines and consider the problem of learning layer by layer a compositional hypothesis class (i.e., a feedforward, multilayer architecture) in a supervised setting. In terms of the models, we present a principled method to “kernelize” (partly or completely) any neural network (NN). With this method, we obtain a counterpart of any given NN that is powered by kernel machines instead of neurons. In terms of learning, when learning a feedforward deep architecture in a supervised setting, one needs to train all the components simultaneously using backpropagation (BP) since there are no explicit targets for the hidden layers (Rumelhart, Hinton, & Williams, 1986). We consider without loss of generality the two-layer case and present a general framework that explicitly characterizes a target for the hidden layer that is optimal for minimizing the objective function of the network. This characterization then makes possible a purely greedy training scheme that learns one layer at a time, starting from the input layer. We provide instantiations of the abstract framework under certain architectures and objective functions. Based on these instantiations, we present a layer-wise training algorithm for an \(l\)-layer feedforward network for classification, where \(l\geq 2\) can be arbitrary. This algorithm can be given an intuitive geometric interpretation that makes the learning dynamics transparent. Empirical results are provided to complement our theory. We show that the kernelized networks, trained layer-wise, compare favorably with classical kernel machines as well as other connectionist models trained by BP. We also visualize the inner workings of the greedy kernelized models to validate our claim on the transparency of the layer-wise algorithm.
68-XX Computer science
Full Text: DOI
[1] Bach, F. R., Lanckriet, G. R., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the Twenty-First International Conference on Machine Learning (p. 6). New York: ACM. ,
[2] Balduzzi, D., Vanchinathan, H., & Buhmann, J. M. (2015). Kickback cuts backprop’s red-tape: Biologically plausible credit assignment in neural networks. In Proceedings of the 29th Conference on Artificial Intelligence (pp. 485-491). Palo Alto: AAAI.
[3] Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463-482. , · Zbl 1084.68549
[4] Bengio, Y. (2014). How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv:1407.7906.
[5] Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828. ,
[6] Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In B. Schölkopf, J. C. Platt, & T. Hoffman (Eds.), Advances in neural information processing systems, 19 (pp. 153-160). Cambridge, MA: MIT Press.
[7] Carreira-Perpinan, M., & Wang, W. (2014). Distributed optimization of deeply nested systems. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, 22 (pp. 10-19).
[8] Cho, Y., & Saul, L. K. (2009). Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotte (Eds.), Advances in neural information processing systems (pp. 342-350). Cambridge, MA: MIT Press.
[9] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. , · Zbl 0831.68098
[10] Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. S. (2002). On kernel-target alignment. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 367-373). Cambridge, MA: MIT Press.
[11] Erdogmus, D., Fontenla-Romero, O., Principe, J. C., Alonso-Betanzos, A., & Castillo, E. (2005). Linear-least-squares initialization of multilayer perceptrons through backpropagation of the desired response. IEEE Transactions on Neural Networks, 16(2), 325-337. ,
[12] Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 524-532). San Mateo, CA: Morgan Kaufmann.
[13] Gardner, J. R., Upchurch, P., Kusner, M. J., Li, Y., Weinberger, K. Q., Bala, K., & Hopcroft, J. E. (2015). Deep manifold traversal: Changing labels with convolutional features. arXiv:1511.06421.
[14] Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic style. arXiv:1508.06576.
[15] Gehler, P., & Nowozin, S. (2008). Infinite kernel learning (Technical Report TR-178). Baden-Wurttemberg: Max Planck Institute for Biological Cybernetics.
[16] Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 315-323).
[17] Gönen, M., & Alpaydın, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12, 2211-2268. · Zbl 1280.68167
[18] Hermans, M., & Schrauwen, B. (2012). Recurrent kernel machines: Computing with infinite echo state networks. Neural Computation, 24(1), 104-133. , · Zbl 1238.68125
[19] Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527-1554. , · Zbl 1106.68094
[20] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507. , · Zbl 1226.68083
[21] Huang, F. J., & LeCun, Y. (2006). Large-scale learning with SVM and convolutional for generic object categorization. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 284-291). Piscataway, NJ: IEEE.
[22] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
[23] Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., & Kavukcuoglu, K. (2016). Decoupled neural interfaces using synthetic gradients. arXiv:1608.05343.
[24] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
[25] Kloft, M., Brefeld, U., Sonnenburg, S., & Zien, A. (2011). LP-norm multiple kernel learning. Journal of Machine Learning Research, 12, 953-997. · Zbl 1280.68173
[26] Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images (Tech. Rep.). Citeseer.
[27] Lanckriet, G. R., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27-72. · Zbl 1222.68241
[28] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th International Conference on Machine Learning (pp. 473-480). New York: ACM. ,
[29] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. ,
[30] LeCun, Y., Cortes, C., & Burges, C. (2010). MNIST handwritten digit database. AT&T Labs. .
[31] Lee, D.-H., Zhang, S., Fischer, A., & Bengio, Y. (2015). Difference target propagation. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 498-515). Berlin: Springer. ,
[32] Mairal, J., Koniusz, P., Harchaoui, Z., & Schmid, C. (2014). Convolutional kernel networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems, 27 (pp. 2627-2635). Red Hook, NY: Curran.
[33] McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4), 115-133. , · Zbl 0063.03860
[34] Micchelli, C. A., Xu, Y., & Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research, 7, 2651-2667. · Zbl 1222.68266
[35] Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basis-function networks. Neural Computation, 3(2), 246-257. ,
[36] Paszke, A., Gross, S., Chintala, S., & Chanan, G. (2017). Pytorch: Tensors and dynamic neural networks in Python with strong GPU acceleration. https://github.com/pytorch/pytorch
[37] Pisier, G. (1999). The volume of convex bodies and banach space geometry. Cambridge: Cambridge University Press. · Zbl 0933.46013
[38] Raghu, M., Gilmer, J., Yosinski, J., & Sohl-Dickstein, J. (2017). SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems, 30 (pp. 6076-6085). Red Hook, NY: Curran.
[39] Rahimi, A., & Recht, B. (2008). Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in neural information processing systems, 20 (pp. 1177-1184). Cambridge, MA: MIT Press.
[40] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-538. , · Zbl 1369.68284
[41] Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. In D. Helmbold & B. Williamson (Eds.), Computational learning theory (pp. 416-426). Berlin: Springer. , · Zbl 0992.68088
[42] Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press.
[43] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958. · Zbl 1318.68153
[44] Sun, S., Chen, W., Wang, L., Liu, X., & Liu, T.-Y. (2016). On the depth of deep neural networks: A theoretical view. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (pp. 2066-2072). Palo Alto, CA: AAAI.
[45] Suykens, J. A. (2017). Deep restricted kernel machines using conjugate feature duality. Neural Computation, 29(8), 2123-2163. , · Zbl 1456.68178
[46] Suykens, J. A., & Vandewalle, J. (1999). Training multilayer perceptron classifiers based on a modified support vector method. IEEE Transactions on Neural Networks, 10(4), 907-911. ,
[47] Tang, Y. (2013). Deep learning using linear support vector machines. arXiv:1306.0239.
[48] Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop, coursera: Neural networks for machine learning (Technical Report). Toronto: University of Toronto.
[49] Vapnik, V. (2000). The nature of statistical learning theory. Berlin: Springer. , · Zbl 0934.62009
[50] Varma, M., & Babu, B. R. (2009). More generality in efficient multiple kernel learning. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 1065-1072). New York: ACM. ,
[51] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 3371-3408. · Zbl 1242.68256
[52] Wilson, A. G., Hu, Z., Salakhutdinov, R., & Xing, E. P. (2016). Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (pp. 370-378).
[53] Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747.
[54] Xu, Z., Jin, R., King, I., & Lyu, M. (2009). An extended level method for efficient multiple kernel learning. In D. Koller, D. Schuurmans, Y. Bengio & L. Bottou (Eds.), Advances in neural information processing systems, 21 (pp. 1825-1832). Red Hook, NY: Curran.
[55] Zhang, S., Li, J., Xie, P., Zhang, Y., Shao, M., Zhou, H., & Yan, M. (2017). Stacked kernel network. arXiv:1711.09219.
[56] Zhou, Z.-H., & Feng, J. (2017). Deep forest: Towards an alternative to deep neural networks. arXiv:1702.08835.
[57] Zhuang, J., Tsang, I. W., & Hoi, S. C. (2011). Two-layer multiple kernel learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 909-917).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.