×

zbMATH — the first resource for mathematics

Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor. (English) Zbl 1421.90093
Summary: Multicollinearity exists when some explanatory variables of a multiple linear regression model are highly correlated. High correlation among explanatory variables reduces the reliability of the analysis. To eliminate multicollinearity from a linear regression model, we consider how to select a subset of significant variables by means of the variance inflation factor (VIF), which is the most common indicator used in detecting multicollinearity. In particular, we adopt the mixed integer optimization (MIO) approach to subset selection. The MIO approach was proposed in the 1970s, and recently it has received renewed attention due to advances in algorithms and hardware. However, none of the existing studies have developed a computationally tractable MIO formulation for eliminating multicollinearity on the basis of VIF. In this paper, we propose mixed integer quadratic optimization (MIQO) formulations for selecting the best subset of explanatory variables subject to the upper bounds on the VIFs of selected variables. Our two MIQO formulations are based on the two equivalent definitions of VIF. Computational results illustrate the effectiveness of our MIQO formulations by comparison with conventional local search algorithms and MIO-based cutting plane algorithms.

MSC:
90C11 Mixed integer programming
90C20 Quadratic programming
62J05 Linear regression; mixed models
Software:
CPLEX; Gurobi; MAIC; R; UCI-ml
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Arthanari, T.S., Dodge, Y.: Mathematical Programming in Statistics. Wiley, New York (1981) · Zbl 0549.62002
[2] Beale, E.M.L.: Two transportation problems. In: Kreweras, G., Morlat, G. (eds.) Proceedings of the Third International Conference on Operational Research, pp. 780-788 (1963)
[3] Beale, E.M.L., Tomlin, J.A.: Special facilities in a general mathematical programming system for non-convex problems using ordered sets of variables. In: Lawrence, J. (ed.) Proceedings of the Fifth International Conference on Operational Research, pp. 447-454 (1970)
[4] Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, Hoboken (2005) · Zbl 0479.62056
[5] Benati, S.; García, S., A mixed integer linear model for clustering with variable selection, Comput. Oper. Res., 43, 280-285, (2014) · Zbl 1349.62258
[6] Bertsimas, D.; Dunn, J., Optimal classification trees, Mach. Learn., 136, 1039-1082, (2017) · Zbl 06841416
[7] Bertsimas, D.; King, A., OR forum: an algorithmic approach to linear regression, Oper. Res., 64, 2-16, (2016) · Zbl 1338.90272
[8] Bertsimas, D.; King, A., Logistic regression: from art to science, Stat. Sci., 32, 367-384, (2017) · Zbl 1442.62166
[9] Bertsimas, D.; King, A.; Mazumder, R., Best subset selection via a modern optimization lens, Ann. Stat., 44, 813-852, (2016) · Zbl 1335.62115
[10] Blum, AL; Langley, P., Selection of relevant features and examples in machine learning, Artif. Intell., 97, 245-271, (1997) · Zbl 0904.68142
[11] Chatterjee, S., Hadi, A.S.: Regression Analysis by Example. Wiley, Hoboken (2012) · Zbl 1263.62099
[12] Dormann, CF; Elith, J.; Bacher, S.; Buchmann, C.; Carl, G.; Carré, G.; García Marquéz, JR; Gruber, B.; Lafoourcade, B.; Leitão, PJ; Münkemüller, T.; McClean, C.; Osborne, PE; Reineking, B.; Schröder, B.; Skidmore, AK; Zurell, D.; Lautenbach, S., Collinearity: a review of methods to deal with it and a simulation study evaluating their performance, Ecography, 36, 27-46, (2013)
[13] Gurobi Optimization, Inc.: Gurobi Optimizer Reference Manual. http://www.gurobi.com (2016). Accessed 6 Oct 2017
[14] Guyon, I.; Elisseeff, A., An introduction to variable and feature selection, J. Mach. Learn. Res., 3, 1157-1182, (2003) · Zbl 1102.68556
[15] Hastie, T., Tibshirani, R., Tibshirani, R.J.: Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692 (2017)
[16] Hocking, RR, The analysis and selection of variables in linear regression, Biometrics, 32, 1-49, (1976) · Zbl 0328.62042
[17] Hoerl, AE; Kennard, RW, Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12, 55-67, (1970) · Zbl 0202.17205
[18] Huberty, CJ, Issues in the use and interpretation of discriminant analysis, Psychol. Bull., 95, 156-171, (1984)
[19] IBM: IBM ILOG CPLEX Optimization Studio. https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/ (2015). Accessed 6 Oct 2017
[20] James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning. Springer, New York (2013) · Zbl 1281.62147
[21] Jolliffe, IT, A note on the use of principal components in regression, Appl. Stat., 31, 300-303, (1982)
[22] Kimura, K.; Waki, H., Minimization of Akaike’s information criterion in linear regression analysis via mixed integer nonlinear program, Optim. Methods Softw., 33, 633-649, (2018) · Zbl 1398.90111
[23] Kohavi, R.; John, GH, Wrappers for feature subset selection, Artif. Intell., 97, 273-324, (1997) · Zbl 0904.68143
[24] Konno, H.; Yamamoto, R., Choosing the best set of variables in regression analysis using integer programming, J. Glob. Optim., 44, 273-282, (2009) · Zbl 1178.62069
[25] Lichman, M.: UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine. http://archive.ics.uci.edu/ml (2013)
[26] Liu, H., Motoda, H.: Computational Methods of Feature Selection. CRC Press, Boca Raton (2007) · Zbl 1130.62118
[27] Maldonado, S.; Pérez, J.; Weber, R.; Labbé, M., Feature selection for support vector machines via mixed integer linear programming, Inf. Sci., 279, 163-175, (2014) · Zbl 1354.68226
[28] Massy, WF, Principal components regression in exploratory statistical research, J. Am. Stat. Assoc., 60, 234-256, (1965)
[29] Mazumder, R.; Radchenko, P., The discrete Dantzig selector: estimating sparse linear models via mixed integer linear optimization, IEEE Trans. Inf. Theory, 63, 3053-3075, (2017) · Zbl 1368.94035
[30] Miller, A.: Subset Selection in Regression. CRC Press, Boca Raton (2002) · Zbl 1051.62060
[31] Miyashiro, R.; Takano, Y., Subset selection by Mallows’ \(C_p\): a mixed integer programming approach, Expert. Syst. Appl., 42, 325-331, (2015)
[32] Miyashiro, R.; Takano, Y., Mixed integer second-order cone programming formulations for variable selection in linear regression, Eur. J. Oper. Res., 247, 721-731, (2015) · Zbl 1346.90616
[33] R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org (2014). Accessed 6 Oct 2017
[34] Sato, T.; Takano, Y.; Miyashiro, R.; Yoshise, A., Feature subset selection for logistic regression via mixed integer optimization, Comput. Optim. Appl., 64, 865-880, (2016) · Zbl 1352.90068
[35] Sato, T.; Takano, Y.; Miyashiro, R., Piecewise-linear approximation for feature subset selection in a sequential logit model, J. Oper. Res. Soc. Jpn., 60, 1-14, (2017) · Zbl 1371.90068
[36] Tamura, R.; Kobayashi, K.; Takano, Y.; Miyashiro, R.; Nakata, K.; Matsui, T., Best subset selection for eliminating multicollinearity, J. Oper. Res. Soc. Jpn., 60, 321-336, (2017) · Zbl 1382.90068
[37] Ustun, B.; Rudin, C., Supersparse linear integer models for optimized medical scoring systems, Mach. Learn., 102, 349-391, (2016) · Zbl 1406.62144
[38] Wilson, ZT; Sahinidis, NV, The ALAMO approach to machine learning, Comput. Chem. Eng., 106, 785-795, (2017)
[39] Wold, S., Ruhe, A., Wold, H., Dunn III, W.J.: The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J. Sci. Stat. Comput. 5, 735-743 (1984) · Zbl 0545.62044
[40] Wold, S.; Sjöström, M.; Eriksson, L., PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst., 58, 109-130, (2001)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.