## Feature selection via sensitivity analysis of SVM probabilistic outputs.(English)Zbl 1470.68175

Summary: Feature selection is an important aspect of solving data-mining and machine-learning problems. This paper proposes a feature-selection method for the Support Vector Machine (SVM) learning. Like most feature-selection methods, the proposed method ranks all features in decreasing order of importance so that more relevant features can be identified. It uses a novel criterion based on the probabilistic outputs of SVM. This criterion, termed Feature-based Sensitivity of Posterior Probabilities (FSPP), evaluates the importance of a specific feature by computing the aggregate value, over the feature space, of the absolute difference of the probabilistic outputs of SVM with and without the feature. The exact form of this criterion is not easily computable and approximation is needed. Four approximations, FSPP1-FSPP4, are proposed for this purpose. The first two approximations evaluate the criterion by randomly permuting the values of the feature among samples of the training data. They differ in their choices of the mapping function from standard SVM output to its probabilistic output: FSPP1 uses a simple threshold function while FSPP2 uses a sigmoid function. The second two directly approximate the criterion but differ in the smoothness assumptions of criterion with respect to the features. The performance of these approximations, used in an overall feature-selection scheme, is then evaluated on various artificial problems and real-world problems, including datasets from the recent Neural Information Processing Systems (NIPS) feature selection competition. FSPP1-3 show good performance consistently with FSPP2 being the best overall by a slight margin. The performance of FSPP2 is competitive with some of the best performing feature-selection methods in the literature on the datasets that we have tested. Its associated computations are modest and hence it is suitable as a feature-selection method for SVM applications.

### MSC:

 68T05 Learning and adaptive systems in artificial intelligence

### Software:

SVMlight; UCI-ml; LIBSVM
Full Text:

### References:

 [1] Boser, B., Guyon, I., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on computational learning theory (pp. 144–152). San Mateo: Kaufmann. [2] Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In Proceedings of the 15th international conference on machine learning (pp. 82–90). San Francisco: Kaufmann. [3] Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. · Zbl 0858.68080 [4] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. · Zbl 1007.68152 [5] Chang, C. C., & Lin, C. J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/$$\sim$$cjlin/libsvm . [6] Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131–159. · Zbl 0998.68101 [7] Chu, W., Keerthi, S. S., & Ong, C. J. (2003). Bayesian trigonometric support vector classifier. Neural Computation, 15(9), 2227–2254. · Zbl 1085.68620 [8] Chu, W., Keerthi, S. S., & Ong, C. J. (2004). Bayesian support vector regression using a unified loss function. IEEE Transactions on Neural Networks, 15(1), 29–44. [9] Cortes, C., & Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20(3), 273–297. · Zbl 0831.68098 [10] Cristianini, N., & Shawe-Taylor, J. (2000). Introduction to support vector machines. Cambridge: Cambridge University Press. · Zbl 0994.68074 [11] Duan, K. B., & Keerthi, S. S. (2005). Which is the best multiclass SVM method? An empirical study. Lecture Notes in Computer Science, 3541, 278–285. · Zbl 05570519 [12] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. · Zbl 1102.68556 [13] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. N. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422. · Zbl 0998.68111 [14] Guyon, I., et al. (2003). NIPS 2003 feature selection competition. http://www.nipsfsc.ecs.soton.ac.uk/ . [15] Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2006a). Feature extraction, foundations and applications. Berlin: Springer. · Zbl 1114.68059 [16] Guyon, I., Gunn, S., Hur, A. B., & Dror, G. (2006b). Feature selection benchmark for classification problems. In I. Guyon, S. Gunn, M. Nikravesh, & L. A. Zadeh (Eds.), Feature extraction, foundations and applications. Berlin: Springer. [17] Günter, S., & Bunke, H. (2004). Feature selection algorithms for the generation of multiple classifier systems and their application to handwritten word recognition. Pattern Recognition Letters, 25(11), 1323–1336. [18] Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. The Annals of Statistics, 26(2), 451–471. · Zbl 0932.62071 [19] Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods: support vector learning. Cambridge: MIT Press. [20] Keerthi, S. S. (2002). Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Transactions on Neural Networks, 13(5), 1225–1229. [21] Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97, 273–324. · Zbl 0904.68143 [22] Lal, T. N., Chapelle, O., Weston, J., & Elisseeff, A. (2006). Embedded methods. In I. Guyon, S. Gunn, M. Nikravesh, & L. A. Zadeh (Eds.), Feature extraction, foundations and applications. Berlin: Springer. [23] Lee, J. H., & Lin, C. J. (2000). Automatic model selection for support vector machines (Technical Report). Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. Online at http://www.csie.ntu.edu.tw/cjlin/papers/modelselect.ps.gz . [24] Lin, H. T., Lin, C. J., & Weng, R. C. (2003). A note on Platt’s probabilistic outputs for support vector machines (Technical Report). Department of Computer Science, National Taiwan University. http://www.csie.ntu.edu.tw/$$\sim$$cjlin/papers/plattprob.ps . [25] Mika, S., Rätsch, G., Weston, J., Schölkopf, B., & Müller, K.-R. (1999). Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing IX (pp. 41–48). New York: IEEE. [26] Neumann, J., Schnörr, C., & Steidl, G. (2005). Combined SVM-based feature selection and classification. Machine Learning, 61(1–3), 129–150. · Zbl 1137.90643 [27] Newman, D. J., Hettich, S., Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. http://www.ics.uci.edu/$$\sim$$mlearn/MLRepository.html . [28] Page, E. S. (1967). A note on generating random permutations. Applied Statistics, 16(3), 273–274. [29] Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods: support vector learning. Cambridge: MIT Press. [30] Platt, J. (2000). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Scholkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers. Cambridge: MIT Press. [31] Rakotomamonjy, A. (2003). Variable selection using SVM-based criteria. Journal of Machine Learning Research, 3, 1357–1370. · Zbl 1102.68583 [32] Rätsch, G. (2005). Benchmark repository. http://ida.first.fhg.de/projects/bench/benchmarks.htm . [33] Rätsch, G., Onoda, T., & Müller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320. · Zbl 0969.68128 [34] Saon, G., & Padmanabhan, M. (2001). Minimum Bayes error feature selection for continuous speech recognition. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems (Vol. 13, pp. 800–806). [35] Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer. · Zbl 0833.62008 [36] Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. · Zbl 0935.62007 [37] Vapnik, V. N., & Chapelle, O. (2000). Bounds on error expectation for support vector machines. Neural Computation, 12(9), 2013–2036. [38] Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. N. (2001). Feature selection for SVMs. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems (Vol. 13, pp. 668–674). [39] Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D.S. Touretzky, M.C. Mozer, & M.E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 598–604).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.