×

A variable selection method based on Tabu search for logistic regression models. (English) Zbl 1176.90268

Summary: A Tabu search method is proposed and analysed for selecting variables that are subsequently used in Logistic Regression Models. The aim is to find from among a set of \(m\) variables a smaller subset which enables the efficient classification of cases. Reducing dimensionality has some very well-known advantages that are summarized in literature. The specific problem consists in finding, for a small integer value of \(p\), a subset of size \(p\) of the original set of variables that yields the greatest percentage of hits in Logistic Regression. The proposed Tabu search method performs a deep search in the solution space that alternates between a basic phase (that uses simple moves) and a diversification phase (to explore regions not previously visited). Testing shows that it obtains significantly better results than the Stepwise, Backward or Forward methods used by classic statistical packages. Some results of applying these methods are presented.

MSC:

90B40 Search theory
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Bala, J.; Dejong, K.; Huang, J.; Vafaie, H.; Wechsler, H., Using learning to facilitate the evolution of features for recognizing visual concepts, Evolutionary Computation, 4, 3, 297-311 (1996)
[2] Cessic, N., Ridge penalization in logistic regression, Applied Statistics, 41, 191-201 (1992)
[3] Cotta, C.; Sloper, C.; Moscato, P., Evolutionary search of thresholds for robust feature set selection: Application to the analysis of microarray data, Lecture Notes in Computer Science, 3005, 21-30 (2004)
[4] Dixon, W. J., Bmdp Statistical Software Manual 1988 (1988), University of California Press
[5] Efroymson, M. A., Multiple regression analysis, (Ralston, A.; Wilf, H. S., Mathematical Methods for Digital Computers, vol. 1 (1960), Wiley: Wiley New York)
[6] Eilers, P., Boer, J., Van Ommen, G.J., Van Houwelingen, H., 2001. Classification of microarray data with penalized logistic regression. In: Proceedings of SPIE, Progress in Biomedical Optics and Images, vol. 4266, pp. 187-198.; Eilers, P., Boer, J., Van Ommen, G.J., Van Houwelingen, H., 2001. Classification of microarray data with penalized logistic regression. In: Proceedings of SPIE, Progress in Biomedical Optics and Images, vol. 4266, pp. 187-198.
[7] Ganster, H.; Pinz, A.; Rohrer, R.; Wildling, E.; Binder, M.; Kittler, H., Automated melanoma recognition, IEEE Transactions On Medical Imaging, 20, 3, 233-239 (2001)
[8] García, F. C.; García, M.; Melián, B.; Moreno, J. A.; Moreno, M., Solving feature selection problem by a parallel scatter search, European Journal of Operational Research, 169, 2, 477-489 (2006) · Zbl 1079.90174
[9] Gatu, C.; Kontoghiorghes, E. J., Parallel algorithms for computing all possible subset regression models using the QR decomposition, Parallel Computing, 29, 505-521 (2003)
[10] Gatu, C.; Kontoghiorghes, E. J., Efficient strategies for deriving the subset VAR models, Computational Management Science, 2, 4, 253-278 (2005) · Zbl 1089.62106
[11] Gatu, C.; Kontoghiorghes, E. J., Branch-and-bound algorithms for computing the best-subset regression models, Journal of Computational and Graphical Statistics, 15, 1, 139-156 (2006)
[12] Glover, F., Tabu search: Part I, ORSA Journal on Computing., Vol. 1, 190-206 (1989) · Zbl 0753.90054
[13] Glover, F., Tabu search: Part II, ORSA Journal on Computing, vol. 2, 4-32 (1990) · Zbl 0771.90084
[14] Glover, F.; Laguna, M.y., Tabu Search (1997), Kluwer Academic Publishers: Kluwer Academic Publishers Boston · Zbl 0930.90083
[15] Glover, F.; Laguna, M.y., Tabu search, (Pardalos, P. M.; Resende, M. G.C., Handbook of Applied Optimization (2002), Oxford University Press: Oxford University Press london), 194-208
[16] Huberty, C. J., Problems with stepwise methods: Better alternatives, (Thompson, B., Advances in social science methodology, vol. 1 (1989), JAI Press: JAI Press Greenwich, CT), 43-70
[17] Inza, I.; Larrañaga, P.; Etxeberria, R.; Sierra, B., Feature subset selection by Bayesian networks based optimization, Artificial Intelligence, 123, 157-184 (2000) · Zbl 0952.68118
[18] Inza, I.; Merino, M.; Larranaga, P.; Quiroga, J.; Sierra, B.; Girala, M., Feature subset selection by genetic algorithms and estimation of distribution algorithms - A case study in the survival of cirrhotic patients treated with TIPS, Artificial Intelligence in Medicine, 23, 2, 187-205 (2001) · Zbl 0986.68825
[19] Inza, I.; Larranaga, P.; Sierra, B., Feature subset selection by Bayesian networks: A comparison with genetic and sequential algorithms, International Journal of Approximate Reasoning, 27, 2, 143-164 (2001) · Zbl 0982.68146
[20] Jain, A.; Zongker, D., Feature selection: Evaluation, application, and small sample performance, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 2, 153-158 (1997)
[21] Jourdan, L., Dhaenens, C., Talbi, E., 2001. A genetic algorithm for feature subset selection in data-mining for genetics. In: MIC 2001 Proceedings, Fourth Metaheuristics International Conference, pp. 29-34.; Jourdan, L., Dhaenens, C., Talbi, E., 2001. A genetic algorithm for feature subset selection in data-mining for genetics. In: MIC 2001 Proceedings, Fourth Metaheuristics International Conference, pp. 29-34.
[22] Kohavi, R., Wrappers for Performance Enhancement and Oblivious Decision Graphs (1995), Stanford University, Computer Science Department
[23] Komarek, P., 2004. Logistic Regression for Data Mining and High-Dimensional Classification. Technical Report TR-O4-34, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.; Komarek, P., 2004. Logistic Regression for Data Mining and High-Dimensional Classification. Technical Report TR-O4-34, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.
[24] Komarek, P., Moore, A., 2003. Fast robust logistic regressión for large sparse datasets with binary outputs. In: Artificial Intelligence and Statistics AISTAT, 2003.; Komarek, P., Moore, A., 2003. Fast robust logistic regressión for large sparse datasets with binary outputs. In: Artificial Intelligence and Statistics AISTAT, 2003.
[25] Komarek, P., Moore, A., 2005. Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity. Technical Report CMU-RI-TR-05-27, Robotics Institute, Carnegie Mellon University.; Komarek, P., Moore, A., 2005. Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity. Technical Report CMU-RI-TR-05-27, Robotics Institute, Carnegie Mellon University.
[26] Lee, S.; Yang, J.; Oh, K. W., Prediction of molecular bioactivity for drug design using a decision tree algorithm, Lecture Notes In Artificial Intelligence, 2843, 344-351 (2003)
[27] Lewis, P. M., The characteristic selection problem in recognition systems, IEEE Transactions on Information Theory, 8, 171-178 (1962) · Zbl 0099.34505
[28] Liu, H.; Motoda, H.y., Feature Selection for Knowledge Discovery and Data Mining (1998), Kluwer Academic: Kluwer Academic Boston · Zbl 0908.68127
[29] Malouf, R., 2002. A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of Conference on Natural Language Learning, vol. 6.; Malouf, R., 2002. A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of Conference on Natural Language Learning, vol. 6.
[30] Meiri, R.; Zahavi, J., Using simulated annealing to optimize the feature selection problem in marketing applications, European Journal of Operational Research, 171, 3, 842-858 (2006) · Zbl 1116.90069
[31] Minka, T.P., 2001. Algorithms for Maximum-likelihood Logistic Regressión. Technical Report Stats 758, Carnegie Mellon University.; Minka, T.P., 2001. Algorithms for Maximum-likelihood Logistic Regressión. Technical Report Stats 758, Carnegie Mellon University.
[32] Murphy, P.M., Aha, D.W., 1994. UCI Repository of Machine Learning. University of California, Department of Information and Computer Science. <http://www.ics.uci.edu/ mlearn/MLRepository.html>; Murphy, P.M., Aha, D.W., 1994. UCI Repository of Machine Learning. University of California, Department of Information and Computer Science. <http://www.ics.uci.edu/ mlearn/MLRepository.html>
[33] Narendra, P. M.; Fukunaga, K., a branch and bound algorithm for feature subset selection, IEEE Transactions on Computers, 26, 9, 917-922 (1977) · Zbl 0363.68059
[34] O’Gorman, T. W.; Woolson, R. F., On the efficacy of the rank transformation in stepwise logistic and discriminant analysis, Statistics in Medicine, 12, 143-151 (1993)
[35] Pacheco, J.; Casado, S.; Núñez, L.; Gómez, O., Analysis of new variable selection methods for discriminant analysis, Computational Statistics and Data Analysis, 51, 3, 1463-1478 (2006) · Zbl 1157.62442
[36] Prieto-Castellanos, K. A., Regressión Logística con penalización Ridge aplicados a datos de expresión genética. Tesis de Maestría. Facultad de Matemáticas (2005), Universidad de Puerto Rico
[37] Salvador, M., 2000. Análisis Discriminante, [on line] 5campus.com, Estadística, University of Zaragoza. <http://5campus.com/leccion/discri>; Salvador, M., 2000. Análisis Discriminante, [on line] 5campus.com, Estadística, University of Zaragoza. <http://5campus.com/leccion/discri>
[38] Sebestyen, G., Decision-Making Processes in Pattern Recognition (1962), MacMillan: MacMillan New York
[39] Shevade, S. K.; Keerthi, S. S., A simple and efficient algorithm for gene selection using sparse logistic regressión”, Bioinformatics, 19, 17, 2246-2253 (2003)
[40] Shy, S.; Suganthan, P. N., Feature analysis and classification of protein secondary structure data, Lecture Notes in Computer Science, 2714, 1151-1158 (2003) · Zbl 1049.92505
[41] Sierra, B.; Lazkano, E.; Inza, I.; Merino, M.; Larrañaga, P.; Quiroga, J., Prototype selection and feature subset selection by estimation of distribution algorithms. A case study in the survival of cirrhotic patients treated with TIPS, Lecture Notes in Artificial Intelligence, 2101, 20-29 (2001) · Zbl 0986.68825
[42] Tamoto, E.; Tada, M.; Murakawa, K.; Takada, M.; Shindo, G.; Teramoto, K.; Matsunaga, A.; Komuro, K.; Kanai, M.; Kawakami, A.; Fujiwara, Y.; Kobayashi, N.; Shirata, K.; Nishimura, N.; Okushiba, S. I.; Kondo, S.; Hamada, J.; Yoshiki, T.; Moriuchi, T.; Katoh, H., Gene expression profile changes correlated with tumor progression and lymph node metastasis in esophageal cancer, Clinical Cancer Research, 10, 11, 3629-3638 (2004)
[43] Visauta, B., Análisis Estadístico con SPSS para WINDOWS Análisis Multivariante, vol. II (1998), Mc-Graw Hill: Mc-Graw Hill Madrid
[44] Wong, M. L.D.; Nandi, A. K., Automatic digital modulation recognition using artificial neural network and genetic algorithm, Signal Processing, 84, 2, 351-365 (2004) · Zbl 1145.94410
[45] Zhang, T.; Oles, F. J., Text Categorization Based on Regularized Linear Classification Methods (2001), Kluwer Academic Publishers · Zbl 1030.68910
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.