zbMATH — the first resource for mathematics

How effectively train large-scale machine learning models? (English) Zbl 1447.90032
Fathi, Mahdi (ed.) et al., Optimization in large scale problems. Industry 4.0 and society 5.0 applications. Cham: Springer. Springer Optim. Appl. 152, 97-110 (2019).
Summary: The stochastic gradient method (SGM) is widely used as an optimization tool in many machine learning applications including support vector machines (SVM)s, logistic regression, graphical models and deep learning. SGM computes the estimates of the gradient from a single randomly chosen sample in each iteration. Therefore, applying a stochastic gradient method for large-scale machine learning problems can be computationally efficient. In this work, we focus on generating generalization bounds for a randomized algorithm such as Random Fourier features learned with a stochastic gradient descent algorithm. Our findings are based on a mutual relationship between the generalization error of an algorithm and its stability properties. The stability of an algorithm is measured by the generalization error regarding the absolute difference between the testing and the training error. Overall, an algorithm is called stable if by changing any single training data point the training error varies slightly. In this work, we measured the stability of stochastic gradient method (SGM) for learning an approximated Fourier primal support vector machine. In particular, under certain regularity assumptions, we showed that a randomized algorithm such as random Fourier feature where is trained by a stochastic gradient method (SGM) with few iterations hasa vanishing generalization error. Therefore, the iterative optimization algorithm can stop long before its convergence to reduce the computational cost. We empirically verified the theoretical findings for different parameters using several data sets.
For the entire collection see [Zbl 1439.90008].
90C25 Convex programming
90C90 Applications of mathematical programming
Full Text: DOI
[1] Duan, J., Luo, L., Li, J., Gao, X., Zhang, W.: Measuring of train wheel surface defect based on linear CCD imaging. In: 2016 18th International Wheelset Congress (IWC), pp. 65-70. IEEE (2016)
[2] Moyne, J., Iskandar, J.: Big data analytics for smart manufacturing: case studies in semiconductor manufacturing. Processes 5(3), 39 (2017)
[3] Scime, L., Beuth, J.: Anomaly detection and classification in a laser powder bed additive manufacturing process using a trained computer vision algorithm. Addit. Manuf. 19, 114-126 (2018)
[4] Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification and Scene Analysis, vol. 3. Wiley, New York (1973) · Zbl 0277.68056
[5] Hagan, M.T., Demuth, H.B., Beale, M.H., De Jesús, O.: Neural Network Design, vol. 20. Pws Pub., Boston (1996)
[6] Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81-106 (1986)
[7] Keller, J.M., Gray, M.R., Givens, J.A.: A fuzzy k-nearest neighbor algorithm. IEEE Trans. Syst. Man Cybern. 4, 580-585. IEEE (1985)
[8] Scholkopf, B., Smola, A.J.: Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)
[9] Niyogi, P., Burges, C., Ramesh, P.: Distinctive feature detection using support vector machines. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 425-428. IEEE (1999)
[10] Schölkopf, B., Smola, A., Muller, K.-R., Burges, C., Vapnik, V.: Support Vector Methods in Learning and Feature Extraction. Citeseer (1998)
[11] Jonsson, K., Kittler, J, Li, Y.P., Matas, J.: Support vector machines for face authentication. Image Vis. Comput. 20(5-6), 369-375 (2002)
[12] Jabri, M., Flower, B.: Weight perturbation: an optimal architecture and learning technique for analog vlsi feedforward and recurrent multilayer networks. IEEE Trans. Neural Netw. 3(1), 154-157 (1992)
[13] Flower, B., Jabri, M.: Summed weight neuron perturbation: an O(N) improvement over weight perturbation. In: Advances in Neural Information Processing Systems, pp. 212-219 (1993)
[14] Nemirovski, A., Yudin, D.: On Cezari’s convergence of the steepest descent method for approximating saddle point of convex-concave functions. Soviet Math. Dokl. 19, 258-269 (1978)
[15] Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574-1609 (2009) · Zbl 1189.90109
[16] Bousquet, O., Elisseeff, A.: Stability and generalization. J. Mach. Learn. Res. 2, 499-526 (2002) · Zbl 1007.68083
[17] Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015)
[18] Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177-1184 (2008)
[19] Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19(5), 1155-1178 (2007) · Zbl 1123.68101
[20] Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., Zhou, Z.-H.: Nyström method vs random fourier features: a theoretical and empirical comparison. In: Advances in Neural Information Processing Systems, pp. 476-484 (2012)
[21] Bubeck, S. et al.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3-4), 231-357 (2015) · Zbl 1365.90196
[22] Lu, J., Hoi, S.C.H., Wang, J., Zhao, P., Liu, Z.-Y.: Large scale online kernel learning. J. Mach. Learn. Res. 17(1), 1613-1655 (2016) · Zbl 1360.68690
[23] Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. program. 127(1), 3-30 (2011) · Zbl 1211.90239
[24] Sutherland, D.J., Schneider, J.: On the error of random fourier features. arXiv preprint arXiv:1506.02785 (2015)
[25] Johnson, R.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.