zbMATH — the first resource for mathematics

Using simulated annealing to optimize the feature selection problem in marketing applications. (English) Zbl 1116.90069
Summary: The feature selection (also, specification) problem is concerned with finding the most influential subset of predictors in predictive modeling from a much larger set of potential predictors that can contain hundreds of predictors. The problem belongs to the realm of combinatorial optimization where the objective is to find the subset of variables that optimize the value of some goodness of fit function. Due to the dimensionality of the problem, the feature selection problem belongs to the group of NP-hard problems. Most of the available predictors are noisy or redundant and add very little, if any, to the prediction power of the model. Using all the predictors in the model often results in strong over-fitting and very poor predictions. Constructing a prediction model by checking out all possible subsets is impractical due to computational volume. Looking on the contribution of each predictor separately is not accurate because it ignores the inter-correlations between predictors. As a result, no analytic solution is available for the feature selection problem, requiring that one resorts to heuristics. In this paper we employ the simulated annealing (SA) approach, which is one of the leading stochastic search methods, for specifying a large-scale linear regression model. The SA results are compared to the results of the more common stepwise regression (SWR) approach for model specification. The models are applied on realistic data sets in database marketing. We also use simulated data sets to investigate what data characteristics make the SWR approach equivalent to the supposedly more superior SA approach.

90B60 Marketing, advertising
62J05 Linear regression; mixed models
Full Text: DOI
[1] Akaike, H., A new look at the statistical identification model, IEEE transactions on automatic control, 19, 716-723, (1974) · Zbl 0314.62039
[2] Almuallim, H.; Dietterich, T.G., Learning with many irrelevant features, (), 547-552
[3] Almuallim, H.; Dietterich, T.G., Learning Boolean concepts in the presence of many irrelevant features, Artificial intelligence, 69, 279-306, (1994) · Zbl 0942.68657
[4] Anily, S.; Federgruen, A., Simulated annealing methods with general acceptance probabilities, Journal of applied probability, 24, 657-667, (1987) · Zbl 0628.60046
[5] Benjamini, Y.; Hochberg, Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the royal statistical society, series B, 57, 289-300, (1995) · Zbl 0809.62014
[6] Blum, A.L.; Langley, P., Selection of relevant features and examples in machine learning, Artificial intelligence, 97, 245-271, (1997) · Zbl 0904.68142
[7] Blum, A.L.; Rivest, R.L., Training a 3-node neural networks is NP-complete, Neural networks, 5, 117-127, (1992)
[8] Cardie, C., Using decision trees to improve case-based learning, (), 25-32
[9] Caruana, R., Freitag, D., 1994. Greedy attribute selection. In: Proceedings of the 11th International Conference on Machine Learning.
[10] Churchill, G.A., Marketing research, (1999), The Dryden Press
[11] Clyde, M.; George, E.I., Empirical Bayes estimation in wavelet nonparametric regression, (), 309-322 · Zbl 0936.62008
[12] Clyde, M., George, E.I., 2000. Flexible empirical Bayes estimation for wavelets. Journal of the Royal Statistics Society, Series B, 62, 681-698. · Zbl 0957.62006
[13] Cruz, J.R.; Dorea, C.C.Y., Simple conditions for the convergence of simulated annealing type algorithms, Journal of applied probability, 35, 885-892, (1998) · Zbl 0932.65068
[14] Dash, M.; Liu, H., Feature selection methods for classifications, Intelligent data analysis: an international journal, 1, 3, (1997)
[15] DeGroot, M.H., Probability and statistics, (1991), Addison-Wesley Reading, MA · Zbl 0101.12902
[16] Devijver, P.A.; Kitler, J., Pattern recognition: A statistical approach, (1982), Prentice Hall Englewood Cliffs, NJ
[17] Eiben, A.E; Aarts, E.H.L.; Van Hee, K.M., Global convergence of genetic algorithms, (), 4-12
[18] Foster, D.P.; George, E.I., The risk inflation criterion for multiple regression, Annals of statistics, 22, 1947-1975, (1994) · Zbl 0829.62066
[19] Foster, D.P.; George, E.I., Calibration and empirical Bayes variable selection, Biometrica, 87, 4, 731-747, (2000) · Zbl 1029.62008
[20] Geman, S.; Bienenstock, E.; Doursat, R., Neural networks and bias/variance dilemma, Neural computation, 4, 1, 1-58, (1992)
[21] George, E.; McCulloch, R., Variable selection in Gibbs sampling, Journal of the American statistical association, 88, 881-889, (1993)
[22] Hancock, T.R., 1989. On the difficulty of finding small consistent decision trees. Unpublished Manuscript, Harvard University.
[23] Hannan, E.J.; Quinn, B.G., The determination of the order of an autoregression, Royal statistical society B, 2, 190-195, (1979) · Zbl 0408.62076
[24] Hodges, J.L., The significance probability of the Smirnov two-sample test, Arkiv for matematik, 3, 469-486, (1957) · Zbl 0084.14502
[25] Hurvich, C.M.; Tsai, C.L., Regression and time series model selection in small samples, Biometrika, 76, 297-307, (1989) · Zbl 0669.62085
[26] Hurvich, C.M.; Tsai, C.L., A crossvalidatory AIC for hard wavelet thresholding in spatially adaptive function estimation, Biometrika, 85, 3, (1998) · Zbl 0954.62040
[27] Jain, A.; Zongker, D., Feature selection evaluation, application and small sample performance, IEEE transactions on pattern analysis and machine intelligence, 19, 2, 153-158, (1997)
[28] John, G.H., Kohavi, R., Pfleger, K., 1994. Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 121-129.
[29] Johnstone, I.M., Silverman, B.W., 1998. Empirical Bayes approaches to mixture problems and wavelet regression. Tech Report, University of Bristol.
[30] Kasse, R.; Raftery, A.E., Bayes factors, Journal of the American statistical association, 90, 773-785, (1995)
[31] Kira, K.; Rendell, L.A., A practical approach to feature selection, ()
[32] Kirkpatrik, S.; Gelatt, C.; Vecchi, M., Optimization by simulated annealing, Science, 220, 671-680, (1983) · Zbl 1225.90162
[33] Kohavi, R.; John, G.H., Wrappers for feature subset selection, Artificial intelligence, 97, 1-2, 273-324, (1998) · Zbl 0904.68143
[34] Kohavi, R.; Sommerfield, D., Wrappers for feature subset selection, Artificial intelligence, 97, 1-2, 273-324, (1995) · Zbl 0904.68143
[35] Koller, D., Sahami, M., 1996. Toward optimal feature selection. In: Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, July 1996, pp. 284-292.
[36] Kononenko, I., 1994. Estimating attributes: Analysis and extensions of relief. In: Bergandano, F., Raedt, L.D. (Eds.), Proceedings of the European Conference on Machine Learning.
[37] Lambert, P.J., The distribution and redistribution of income, (1993), Manchester University Press
[38] Liu, H.; Motoda, H., Feature selection for knowledge discovery and data mining, (1998), Kluwer Academic Publishers · Zbl 0908.68127
[39] Melab, N., Cahon, S., Talbi, E.-G., Duponchel, L., 2002. Parallel GA-based wrapper feature selection for spectroscopic data mining. International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops, April 2002.
[40] Metropolis, N.; Rosenbluth, A.; Rosenbluth, M.; Teller, A.; Teller, E., Simulated annealing, Journal of chemical physics, 21, 1087, (1953)
[41] Miller, A.J., Subset selection in regression, (2002), Chapman and Hall, (2nd Edition) · Zbl 1051.62060
[42] Mitchell, T.J.; Beauchamp, J.J., Bayesian variable selection in linear regression (with discussion), Journal of the American statistical association, 83, 1023-1036, (1998) · Zbl 0673.62051
[43] Motoda, H.; Liu, H., Feature selection, Handbook of data mining and knowledge discovery, (2002), Oxford University Press, pp . 208-213
[44] Rudolph, G., Convergence analysis of canonical genetic algorithms, IEEE transactions on neural networks, special issue on evolutionary computation, 5, 1, (1994)
[45] Schilmmer, L.C., 1993. Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 284-290.
[46] Schwarz, G., Estimating the dimension of a model, Annals of statistics, 6, 461-464, (1978) · Zbl 0379.62005
[47] Shao, J., Linear model selection by cross-validation, Journal of the American statistical association, 88, 486-494, (1993) · Zbl 0773.62051
[48] Siedlecki, W.; Sklansky, J., On automatic feature selection, International journal of pattern recognition and artificial intelligence, 2, 197-220, (1988)
[49] Tibshirani, R.; Knight, K., The covariance inflation criterion for model selection, Journal of the royal statistical society, series B, biometrika, 85, 701-710, (1999)
[50] van Laarhoven, P.J.J.; Aarts, E.H.L., Simulated annealing: theory and application, (1987), Kluwer Boston · Zbl 0643.65028
[51] Wei, C.Z., On predictive least squares principles, Annals of statistics, 29, 1-42, (1992) · Zbl 0801.62083
[52] Ye, J., On measuring and correcting the effects of data mining and model selection, Journal of the American statistical association, 93, 120-131, (1998) · Zbl 0920.62056
[53] Zheng, X.; Loh, WY., A consistent variable selection criterion for linear models with high-dimensional covariates, Statistica sinica, 7, 311-325, (1997) · Zbl 0880.62068
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.