Least squares support vector machines.

*(English)*Zbl 1017.93004
Singapore: World Scientific. xiv, 294 p. (2002).

The study of pattern recognition, classification and nonlinear modelling has been facilitated over the past few decades by advances in neural networks. And while the latter have sometimes been seen almost as a panacea to practical problems, the authors of the book under review note that “one has come in a stage now where it is important to understand the limits of intelligence, both artificial and human” [p. viii]. In thus recognising the importance of bounds the authors may perhaps be seen as supporting Agnosticism in the sense in which the word was originally introduced by T. H. Huxley, viz. “a man shall not say he knows or believes that which he has no scientific grounds for professing to know or believe” [III, p. 98].

The authors present a work that, broadly speaking, deals with (methods for) the solution of nonlinear modelling and classification problems by convex optimization, these methods being relatively free of local minima. I shall offer here a brief summary of the several chapters.

In the introductory chapter the authors review some of the relevant matters from neural networks. Multilayer perceptron neural networks are introduced, and their importance in the approximation of continuous nonlinear functions is stressed (one hidden layer is sufficient for universal approximation). Such approximation is better than that using polynomial expansions inasmuch as the dimension of the input space can be better handled [p. 4], and indeed the matter of dimension reduction is outlined in §1.5. Radial Basis Function (RBF) networks are also mentioned. Classification and Pattern Recognition are considered, with regression methods being introduced together with an appropriate Bayesian approach. The importance of both parametric and non-parametric methods is stressed.

In Chapter 2, ‘Support Vector Machines’, the authors present some standard formulations of SVMs. Both linear and nonlinear SVM classifiers are treated, in both the separable and the non-separable case. The SVM formulations being established within the context of convex optimization theory, the problem initially framed in the primal weight space is solved by first formulating the Lagrangian and then solving the problem in the dual Lagrange multiplier space. It is noted that the primal and dual problems correspond to parametric and non-parametric approaches respectively. The use of SVMs in linear and nonlinear function estimation is expedited by the use of the ‘kernel trick’. Here the input data are mapped into a high dimensional feature space by a nonlinear mapping \(\phi\), the application of Mercer’s Theorem then allowing one ‘to work in huge dimensional feature spaces without actually having to do explicit computations in this space’ [p. 37]. The Vapnik-Chervonenkis bound on the generalization error is given. The results of SVM regression from the commonly used cost functions are extended to any convex cost function.

The third chapter, ‘Basic Methods of Least Squares Support Vector Machines’, considers results for classification and nonlinear function estimation. The statistician will be interested to see here a link with the Fisher discriminant analysis in high dimensional feature spaces. The close relationship between LS-SVM regression and regularization networks, Gaussian processes, reproducing-kernel Hilbert spaces and (again for the statistician) Kriging and kernel ridge regression is emphasized, and it is also noted that linear Karush-Kuhn-Tucker systems characterize LS-SVM models for classification and nonlinear regression.

Chapter 4, ‘Bayesian Inference for LS-SVM Models’, contains a complete framework for the Bayesian inference of LS-SVM classifiers and function estimators. Bayesian methods allow the automatic determination of hyperparameters (tuning parameters in the LS-SVM setting) and also the derivation of error bars on the output. The authors show that the Bayesian inference LS-SVM approach to nonlinear function estimation is very similar to the classification approach.

In the univariate case the authors define the Occam factor by \(p(\widehat{\theta}\mid \mathcal{H}_\sigma) = \sigma_{\theta\mid \mathcal{D}}/\sigma_\theta\), where \(\widehat{\theta}\) is the point at which the posterior density is maximised, \({\mathcal{H}_\sigma}\) is a model with RBF kernel width \(\sigma\), and \(\mathcal{D}\) is the given data training set. Further, \(\sigma_{\theta\mid\mathcal{D}}\) and \(\sigma_\theta\) denote the spread of the posterior and prior distributions respectively. The statistician who is perhaps more used to Bayesian methods may prefer to view the Occam factor as a special form of the better-known Bayes factor, the Bayes factor for (model) \(M_0\) against \(M_1\) being defined by \(B_{01} = p(\mathbf{y}|M_0)/ p(\mathbf{y}|M_1)\). The Occam factor may then be seen as a measure of parsimony in model building. (The place of the Bayes factor in significance tests certainly dates back at least to H. Jeffreys [Theory of probability, Oxford (1939; Zbl 0023.14501)].)

Although some of the methods discussed so far in the text result in (relatively) simple formulations, there is a possibility of potential drawbacks arising from a lack either of sparseness (which can be overcome by the use of pruning methods) or of robustness. Chapter 5 deals with the latter problem, the enhancement of robustness of LS-SVM models for nonlinear function estimation being achieved by the use of robust statistics (e.g. trimmed means).

Large scale problems are considered in Chapter 6, extensive use being made of the Nyström method. The authors present a new technique of fixed size LS-SVM, in which explicit links between function estimation and density estimation are given. This technique also exploits the primal-dual formulations and shows how to choose suitable support vectors as opposed to the random points in the Nyström method. Attention is also paid to committee networks, the guiding tenet here being that ‘the whole is more than the sum of its parts’.

Chapter 7 is devoted to unsupervised learning. Major tools here are Principal Component Analysis (often referred to, for reasons that escape me, as ‘PCA analysis’) and Canonical Correlation Analysis.

The last chapter is entitled ‘LS-SVM for Recurrent Networks and Control’. Here the preceding work, concerned almost always with static situations (the formulations do not involve any recursive equations), is extended to dynamic problems. Although the problems are now non-convex, methods used before are still applicable. Attention is also given to optimal control problems.

Certain mathematical and statistical definitions and results are given in an appendix, and there is a comprehensive bibliography.

One of the authors’ main aims in writing this book is the presentation of ‘a general framework \(\ldots\) for a class of support vector machines towards supervised and unsupervised learning and feedforward as well as recurrent networks’ [p. vi]. Another is the offering of ‘an interdisciplinary forum where different fields can meet’ [p. vii], and indeed the reader will have to have at his fingertips more than a nodding acquaintance with neural networks, optimization, linear algebra, control theory and statistics among others.

One might be irritated, as I was, by the absence of full-stops in the (manifold) abbreviations – or ‘Acronyms’, as the list at the back of the book is headed. One might be tempted to blame the authors, but one must not ignore the possible influence of national preference or the more stringent ‘House Style’ commands. Typing errors are few, and easily corrected. There is a nice neologism on page 111, viz. ‘sparsify’: presumably one who makes a vector sparse is a ‘sparsifier’.

This is neither a textbook nor a reference work, and it is intended neither for the neophyte nor for the casual reader. It is rather a work that the researcher in this field would want to have at his elbow for the finding of suitable methods for use in specific situations.

The authors present a work that, broadly speaking, deals with (methods for) the solution of nonlinear modelling and classification problems by convex optimization, these methods being relatively free of local minima. I shall offer here a brief summary of the several chapters.

In the introductory chapter the authors review some of the relevant matters from neural networks. Multilayer perceptron neural networks are introduced, and their importance in the approximation of continuous nonlinear functions is stressed (one hidden layer is sufficient for universal approximation). Such approximation is better than that using polynomial expansions inasmuch as the dimension of the input space can be better handled [p. 4], and indeed the matter of dimension reduction is outlined in §1.5. Radial Basis Function (RBF) networks are also mentioned. Classification and Pattern Recognition are considered, with regression methods being introduced together with an appropriate Bayesian approach. The importance of both parametric and non-parametric methods is stressed.

In Chapter 2, ‘Support Vector Machines’, the authors present some standard formulations of SVMs. Both linear and nonlinear SVM classifiers are treated, in both the separable and the non-separable case. The SVM formulations being established within the context of convex optimization theory, the problem initially framed in the primal weight space is solved by first formulating the Lagrangian and then solving the problem in the dual Lagrange multiplier space. It is noted that the primal and dual problems correspond to parametric and non-parametric approaches respectively. The use of SVMs in linear and nonlinear function estimation is expedited by the use of the ‘kernel trick’. Here the input data are mapped into a high dimensional feature space by a nonlinear mapping \(\phi\), the application of Mercer’s Theorem then allowing one ‘to work in huge dimensional feature spaces without actually having to do explicit computations in this space’ [p. 37]. The Vapnik-Chervonenkis bound on the generalization error is given. The results of SVM regression from the commonly used cost functions are extended to any convex cost function.

The third chapter, ‘Basic Methods of Least Squares Support Vector Machines’, considers results for classification and nonlinear function estimation. The statistician will be interested to see here a link with the Fisher discriminant analysis in high dimensional feature spaces. The close relationship between LS-SVM regression and regularization networks, Gaussian processes, reproducing-kernel Hilbert spaces and (again for the statistician) Kriging and kernel ridge regression is emphasized, and it is also noted that linear Karush-Kuhn-Tucker systems characterize LS-SVM models for classification and nonlinear regression.

Chapter 4, ‘Bayesian Inference for LS-SVM Models’, contains a complete framework for the Bayesian inference of LS-SVM classifiers and function estimators. Bayesian methods allow the automatic determination of hyperparameters (tuning parameters in the LS-SVM setting) and also the derivation of error bars on the output. The authors show that the Bayesian inference LS-SVM approach to nonlinear function estimation is very similar to the classification approach.

In the univariate case the authors define the Occam factor by \(p(\widehat{\theta}\mid \mathcal{H}_\sigma) = \sigma_{\theta\mid \mathcal{D}}/\sigma_\theta\), where \(\widehat{\theta}\) is the point at which the posterior density is maximised, \({\mathcal{H}_\sigma}\) is a model with RBF kernel width \(\sigma\), and \(\mathcal{D}\) is the given data training set. Further, \(\sigma_{\theta\mid\mathcal{D}}\) and \(\sigma_\theta\) denote the spread of the posterior and prior distributions respectively. The statistician who is perhaps more used to Bayesian methods may prefer to view the Occam factor as a special form of the better-known Bayes factor, the Bayes factor for (model) \(M_0\) against \(M_1\) being defined by \(B_{01} = p(\mathbf{y}|M_0)/ p(\mathbf{y}|M_1)\). The Occam factor may then be seen as a measure of parsimony in model building. (The place of the Bayes factor in significance tests certainly dates back at least to H. Jeffreys [Theory of probability, Oxford (1939; Zbl 0023.14501)].)

Although some of the methods discussed so far in the text result in (relatively) simple formulations, there is a possibility of potential drawbacks arising from a lack either of sparseness (which can be overcome by the use of pruning methods) or of robustness. Chapter 5 deals with the latter problem, the enhancement of robustness of LS-SVM models for nonlinear function estimation being achieved by the use of robust statistics (e.g. trimmed means).

Large scale problems are considered in Chapter 6, extensive use being made of the Nyström method. The authors present a new technique of fixed size LS-SVM, in which explicit links between function estimation and density estimation are given. This technique also exploits the primal-dual formulations and shows how to choose suitable support vectors as opposed to the random points in the Nyström method. Attention is also paid to committee networks, the guiding tenet here being that ‘the whole is more than the sum of its parts’.

Chapter 7 is devoted to unsupervised learning. Major tools here are Principal Component Analysis (often referred to, for reasons that escape me, as ‘PCA analysis’) and Canonical Correlation Analysis.

The last chapter is entitled ‘LS-SVM for Recurrent Networks and Control’. Here the preceding work, concerned almost always with static situations (the formulations do not involve any recursive equations), is extended to dynamic problems. Although the problems are now non-convex, methods used before are still applicable. Attention is also given to optimal control problems.

Certain mathematical and statistical definitions and results are given in an appendix, and there is a comprehensive bibliography.

One of the authors’ main aims in writing this book is the presentation of ‘a general framework \(\ldots\) for a class of support vector machines towards supervised and unsupervised learning and feedforward as well as recurrent networks’ [p. vi]. Another is the offering of ‘an interdisciplinary forum where different fields can meet’ [p. vii], and indeed the reader will have to have at his fingertips more than a nodding acquaintance with neural networks, optimization, linear algebra, control theory and statistics among others.

One might be irritated, as I was, by the absence of full-stops in the (manifold) abbreviations – or ‘Acronyms’, as the list at the back of the book is headed. One might be tempted to blame the authors, but one must not ignore the possible influence of national preference or the more stringent ‘House Style’ commands. Typing errors are few, and easily corrected. There is a nice neologism on page 111, viz. ‘sparsify’: presumably one who makes a vector sparse is a ‘sparsifier’.

This is neither a textbook nor a reference work, and it is intended neither for the neophyte nor for the casual reader. It is rather a work that the researcher in this field would want to have at his elbow for the finding of suitable methods for use in specific situations.

Reviewer: Andrew Dale (Durban)

##### MSC:

93-02 | Research exposition (monographs, survey articles) pertaining to systems and control theory |

93E10 | Estimation and detection in stochastic control theory |

62G05 | Nonparametric estimation |

62M45 | Neural nets and related approaches to inference from stochastic processes |

93A15 | Large-scale systems |

68T05 | Learning and adaptive systems in artificial intelligence |

62H30 | Classification and discrimination; cluster analysis (statistical aspects) |

62H25 | Factor analysis and principal components; correspondence analysis |

62F15 | Bayesian inference |

90C25 | Convex programming |

62G35 | Nonparametric robustness |

62K20 | Response surface designs |