×

A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis. (English) Zbl 1506.62029

Summary: Variable selection is a venerable problem in multivariate statistics. Simulated annealing is one of a variety of metaheuristics that can be gainfully employed for variable selection; however, its effectiveness is influenced by algorithm design features such as the construction of the initial subset, the maximum and minimum temperatures, the cooling scheme, and the process for generating trial subsets in the neighborhood of the incumbent subset. These design features were manipulated to produce 24 versions of a simulated annealing algorithm for the problem of selecting exactly \(p\) out of \(m\) candidate variables. The versions were then compared within the contexts of principal component analysis and discriminant analysis. The results suggest some complex and interesting interactions among the design features, yet some robust versions across the two studies were established.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H25 Factor analysis and principal components; correspondence analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P15 Applications of statistics to psychology
90C59 Approximation methods and heuristics in mathematical programming
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aarts, E.; Korst, J.; Michiels, W., Simulated annealing, (Burke, E. K.; Kendall, G., Search Methodologies, (2005), Springer New York), 187-210
[2] Balabin, R. M.; Smirnov, S. V., Variable selection in near-infrared spectroscopy: benchmarking of feature selection methods on biodiesel data, Anal. Chim. Acta, 692, 63-72, (2011)
[3] Beale, E. M.L.; Kendall, M. G.; Mann, D. W., The discarding of variables in multivariate analysis, Biometrika, 54, 357-366, (1967)
[4] Brusco, M. J., On the performance of simulated annealing for large-scale L2 unidimensional scaling, J. Classification, 23, 255-268, (2006) · Zbl 1336.62174
[5] Brusco, M. J.; Jacobs, L. W., A simulated annealing approach to the cyclic staff scheduling problem, Nav. Res. Logist., 40, 69-84, (1993) · Zbl 0769.90056
[6] Brusco, M. J.; Jacobs, L. W.; Bongiorno, R. J.; Lyons, D. V.; Tang, B., Improving personnel scheduling at airline stations, Oper. Res., 43, 741-751, (1995) · Zbl 0843.90069
[7] Brusco, M. J.; Jacobs, L. W.; Thompson, G. M., A morphing procedure to supplement a simulated annealing heuristic for cost-and coverage-correlated set-covering problems, Ann. Oper. Res., 86, 611-627, (1999) · Zbl 0922.90112
[8] Brusco, M. J.; Köhn, H.-F., Exemplar-based clustering via simulated annealing, Psychometrika, 74, 457-475, (2009) · Zbl 1272.62112
[9] Brusco, M. J.; Singh, R.; Steinley, D., Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis, Psychometrika, 74, 705-726, (2009) · Zbl 1179.62085
[10] Brusco, M. J.; Steinley, D., Exact and approximate algorithms for variable selection in linear discriminant analysis, Comput. Statist. Data Anal., 55, 123-131, (2011) · Zbl 1247.62153
[11] Brusco, M. J.; Steinley, D.; Cradit, J. D., An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression, Technometrics, 51, 306-315, (2009)
[12] Cadima, J., Cerdeira, J.O., Duarte Silva, P., Minhoto, M., 2012, The subselect R package, Retrieved from: http://cran.r-project.org/web/packages/subselect/subselect.pdf.
[13] Cadima, J.; Cerdeira, J. O.; Minhoto, M., Computational aspects of algorithms for variable selection in the context of principal components, Comput. Statist. Data Anal., 47, 225-236, (2004) · Zbl 1429.62216
[14] Cadima, J.; Jolliffe, I. T., Loadings and correlations in the interpretation of principal components, J. Appl. Stat., 22, 203-214, (1995)
[15] Cadima, J.; Jolliffe, I. T., Variable selection and the interpretation of principal subspaces, J. Agric. Biol. Environ. Stat., 6, 62-79, (2001)
[16] Cerny, V., A thermodynamical approach to the traveling salesman problem, J. Optim. Theory Appl., 45, 41-51, (1985) · Zbl 0534.90091
[17] Ceulemans, E.; Van Mechelen, I., CLASSI: A classification model for the study of sequential processes and individual differences therein, Psychometrika, 73, 107-124, (2008) · Zbl 1143.62092
[18] Ceulemans, E.; Van Mechelen, I.; Leenen, I., The local minima problem in hierarchical classes analysis: an evaluation of a simulated annealing algorithm and various multistart procedures, Psychometrika, 72, 377-391, (2007) · Zbl 1286.62102
[19] Chiyoshi, F.; Galvão, D., A statistical analysis of simulated annealing applied to the \(p\)-Median problem, Ann. Oper. Res., 96, 61-74, (2000) · Zbl 0997.90042
[20] Dray, S., On the number of principal components: a test of dimensionality based on measurements of similarity between matrices, Comput. Statist. Data Anal., 52, 2228-2237, (2008) · Zbl 1452.62409
[21] Drezner, Z.; Marcoulides, G. A., Using simulated annealing for selection in multiple regression analysis, Mult. Linear Regression Viewp., 25, 1-4, (1999)
[22] Duarte Silva, A. P., Efficient variable screening for multivariate analysis, J. Multivariate Anal., 76, 35-62, (2001) · Zbl 0996.62063
[23] Duarte Silva, A. P., Discarding variables in principal component analysis: algorithms for all-subsets comparisons, Comput. Statist., 17, 251-271, (2002) · Zbl 1010.62052
[24] Efroymson, M. A., Multiple regression analysis, (Ralston, A.; Wilf, H. S., Mathematical Methods for Digital Computers, (1960), Wiley New York), 191-203
[25] Fisher, R. A., The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7, 179-188, (1936)
[26] Fisher, R. A., The statistical utilization of multiple measurements, Ann. Eugenics, 8, 376-386, (1938) · Zbl 0019.35703
[27] Fouskakis, D.; Draper, D., Comparing stochastic optimization methods for variable selection in binary outcome prediction, with application to health policy, J. Amer. Statist. Assoc., 103, 1367-1381, (2008) · Zbl 1286.62065
[28] Furnival, G. M.; Wilson, R. W., Regression by leaps and bounds, Technometrics, 16, 499-512, (1974) · Zbl 0294.62079
[29] Ganeshanandam, S.; Krzanowski, W. J., On selecting variables and assessing their performance in linear discriminant analysis, Aust. J. Stat., 31, 433-447, (1989) · Zbl 0707.62120
[30] Gatu, C.; Kontoghiorghes, E. J., Branch-and-bound algorithms for computing the best-subset regression models, J. Comput. Graph. Statist., 15, 139-156, (2006)
[31] Gatu, C.; Yanev, P.; Kontoghiorghes, E. J., A graph approach to generate all possible subset regression submodels, Comput. Statist. Data Anal., 52, 799-815, (2007) · Zbl 1452.62061
[32] Glover, F.; Laguna, M., Tabu search, (Reeves, C., Modern Heuristic Techniques for Combinatorial Problems, (1993), Blackwell Oxford), 70-141
[33] Goldberg, D. E., Genetic algorithms in search, optimization, and machine learning, (1989), Addison-Wesley New York · Zbl 0721.68056
[34] Habbema, J. D.F.; Hermans, J., Selection of variables in discriminant analysis by—statistic and error rate, Technometrics, 19, 487-493, (1977) · Zbl 0369.62002
[35] Hofmann, M.; Gatu, C.; Kontoghiorghes, E. J., Efficient algorithms for computing the best subset regression models for large-scale problems, Comput. Statist. Data Anal., 52, 16-29, (2007) · Zbl 1452.62497
[36] Hogarty, K. Y.; Kromrey, J. D.; Ferron, J. M.; Hines, C. V., Selection of variables in exploratory factor analysis: an empirical comparison of a stepwise and traditional approach, Psychometrika, 69, 593-611, (2004) · Zbl 1306.62428
[37] Horton, I. F.; Russell, J. S.; Moore, A. W., Multivariate-covariance and canonical analysis: a method for selecting the most effective discriminators in a multivariate situation, Biometrics, 24, 845-858, (1968)
[38] Huberty, C. J., Issues in the use and interpretation of discriminant analysis, Psychol. Bull., 95, 156-171, (1984)
[39] Jolliffe, I. T., Principal component analysis, (2002), Springer New York · Zbl 1011.62064
[40] Jolliffe, I. T., Discarding variables in a principal component analysis, I: artificial data, Appl. Stat., 21, 160-173, (1972)
[41] Jolliffe, I. T., Discarding variables in a principal component analysis, II: real data, Appl. Stat., 22, 21-31, (1973)
[42] Kano, Y.; Harada, A., Stepwise variable selection in factor analysis, Psychometrika, 65, 7-22, (2000) · Zbl 1291.62219
[43] Kirkpatrick, S.; Gelatt, C. D.; Vecchi, M. P., Optimization by simulated annealing, Science, 220, 671-680, (1983) · Zbl 1225.90162
[44] Krzanowski, W. J., Selection of variables to preserve multivariate data structure using principal components, Appl. Stat., 36, 22-33, (1987)
[45] Krzanowski, W. J.; Hand, D. J., A simple method for screening variables before clustering of microarray data, Comput. Statist. Data Anal., 53, 2747-2753, (2009) · Zbl 1453.62127
[46] Lee, J.-S.; Park, C. H.; Ebrahimi, T., Theory and applications of hybrid simulated annealing, (Zelinka, I.; Snasel, V.; Abraham, A., Handbook of Optimization: From Classical to Modern Approach, (2012), Springer-Verlag Berlin), 395-422
[47] Using MATLAB (version 7), (2005), The MathWorks, Inc Natick, MA
[48] McCabe, G. P., Principal variables, Technometrics, 26, 137-144, (1984) · Zbl 0548.62037
[49] McCabe, G. P., Computations for variable selection in discriminant analysis, Technometrics, 17, 103-109, (1975) · Zbl 0295.62068
[50] McHenry, C. E., An efficient computational procedure for selecting a best subset, Appl. Stat., 27, 291-296, (1978) · Zbl 0436.62057
[51] McKay, R. J.; Campbell, N. A., Variable selection techniques in discriminant analysis: I. description, British J. Math. Statist. Psych., 35, 1-29, (1982) · Zbl 0491.62048
[52] McKay, R. J.; Campbell, N. A., Variable selection techniques in discriminant analysis: II. allocation, British J. Math. Statist. Psych., 35, 30-41, (1982) · Zbl 0491.62049
[53] McLachlan, G. J., A criterion for selecting variables for the linear discriminant function, Biometrics, 32, 529-534, (1976) · Zbl 0334.62023
[54] Metropolis, N.; Rosenbluth, A.; Rosenbluth, M.; Teller, A.; Teller, E., Equation of state calculations by fast computing machines, J. Chem. Phys., 21, 1087-1092, (1953)
[55] Mladenović, N.; Hansen, P., Variable neighborhood search, Comput. Oper. Res., 24, 1097-1100, (1997) · Zbl 0889.90119
[56] Mori, Y.; Iizuka, M.; Tarumi, T.; Tanaka, Y., Variable selection in principal component analysis, (Härdle, W.; Mori, Y.; Vieu, P., Statistical Methods for Biostatistics and Related Fields, (2007), Springer-Verlag Berlin), 265-284
[57] Murillo, A.; Vera, J.-F.; Heiser, W. J., A permutation-translation simulated annealing algorithm for \(L_1\) and \(L_2\) unidimensional scaling, J. Classification, 22, 119-138, (2005) · Zbl 1084.62055
[58] Murray, G. D., A cautionary note on selection of variables in discriminant analysis, Appl. Stat., 26, 246-250, (1977)
[59] Nourani, Y.; Andresen, B., A comparison of simulated annealing cooling strategies, J. Phys. A: Math Gen., 31, 8373-8385, (1998) · Zbl 0962.82068
[60] Örkcü, H. H., Subset selection in multiple linear regression models: a hybrid of genetic and simulated annealing algorithms, Appl. Math. Comput., 219, 11018-11028, (2013) · Zbl 1302.62159
[61] Pacheco, J.; Casado, S.; Nunez, L., A variable selection method based on tabu search for logistic regression models, European J. Oper. Res., 199, 506-511, (2009) · Zbl 1176.90268
[62] Pacheco, J.; Casado, S.; Nunez, L.; Gomez, O., Analysis of new variable selection methods for discriminant analysis, Comput. Statist. Data Anal., 51, 1463-1478, (2006) · Zbl 1157.62442
[63] Pacheco, J.; Casado, S.; Porras, S., Exact methods for variable selection in principal component analysis: guide functions and pre-selection, Comput. Statist. Data Anal., 57, 95-111, (2013) · Zbl 1365.62224
[64] Pendharkar, P.; Nanda, S., A misclassification cost-minimizing evolutionary-neural classification approach, Nav. Res. Logist., 53, 432-447, (2006) · Zbl 1127.62088
[65] Ramsay, J. O.; ten Berge, J. M.F.; Styan, G. P.H., Matrix correlation, Psychometrika, 49, 403-423, (1984) · Zbl 0581.62048
[66] Robert, P.; Escoufier, Y., A unifying tool for linear multivariate statistical methods, Appl. Stat., 25, 257-265, (1976)
[67] Roy, J., Step-down procedures in multivariate statistics, Ann. Math. Stat., 29, 177-187, (1958)
[68] Seaman, S. L.; Young, D. M., A non-parametric variable selection algorithm for allocatory discriminant analysis, Educ. Psychol. Meas., 50, 837-841, (1990)
[69] Snappin, S. N.; Knoke, J. D., Estimation of error rates in discriminant analysis with selection of variables, Biometrics, 45, 289-299, (1989) · Zbl 0715.62116
[70] Swierenga, H.; de Groot, P. J.; de Weijer, A. P.; Derksen, M. W.J.; Buydens, L. M.C., Improvement of PLS model transferability by robust wavelength, Chemometr. Intell. Lab. Syst., 41, 237-248, (1998)
[71] Tanaka, Y.; Mori, Y., Principal component analysis based on a subset of variables: variable selection and sensitivity analysis, Amer. J. Math. Management Sci., 17, 61-89, (1997) · Zbl 1007.62517
[72] Trendafilov, N. T.; Jolliffe, I. T., DALASS: variable selection in discriminant analysis via the lasso, Comput. Statist. Data Anal., 51, 3718-3736, (2007) · Zbl 1161.62379
[73] Urbakh, V. Y., Linear discriminant analysis: loss of discriminating power when a variable is omitted, Biometrics, 27, 531-534, (1971)
[74] Vande Gaer, E.; Ceulemans, E.; Van Mechelen, I.; Kuppens, P., The CLASSI-N method for the study of sequential processes, Psychometrika, 77, 85-105, (2012) · Zbl 1284.62756
[75] van Rosmalen, J.; Groenen, P. J.F.; Trejos, J.; Castillo, W., Optimization strategies for two-mode partitioning, J. Classification, 26, 155-181, (2009) · Zbl 1337.62145
[76] Vera, J.-F.; Heiser, W. J.; Murillo, A., Global optimization in any Minkowski metric: a permutation-translation simulated annealing algorithm for multidimensional scaling, J. Classification, 24, 277-301, (2007) · Zbl 1159.91469
[77] Weiner, J. M.; Dunn, O. J., Elimination of variates in linear discrimination problems, Biometrics, 22, 268-275, (1966)
[78] Wood, M.; Jolliffe, I. T.; Horgan, G. W., Variable selection for discriminant analysis of fish sounds using matrix correlations, J. Agric. Biol. Environ. Stat., 10, 1-16, (2005)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.