×

The EAS approach to variable selection for multivariate response data in high-dimensional settings. (English) Zbl 07731274

Summary: In this paper, we develop an epsilon admissible subsets (EAS) model selection approach for performing group variable selection in the high-dimensional multivariate regression setting. This EAS strategy is designed to estimate a posterior-like, generalized fiducial distribution over a parsimonious class of models in the setting of correlated predictors and/or in the absence of a sparsity assumption. The effectiveness of our approach, to this end, is demonstrated empirically in simulation studies, and is compared to other state-of-the-art model/variable selection procedures. Furthermore, assuming a matrix-Normal linear model we show that the EAS strategy achieves strong model selection consistency in the high-dimensional setting if there does exist a sparse, true data generating set of predictors. In contrast to Bayesian approaches for model selection, our generalized fiducial approach completely avoids the problem of simultaneously having to specify arbitrary prior distributions for model parameters and penalize model complexity; our approach allows for inference directly on the model complexity. Implementation of the method is illustrated through yeast data to identify significant cell-cycle regulating transcription factors.

MSC:

62H12 Estimation in multivariate analysis
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] BAI, R. and GHOSH, M. (2018). High-dimensional multivariate posterior consistency under global-local shrinkage priors. Journal of Multivariate Analysis 167 157-170. · Zbl 1403.62134
[2] BAI, R. and GHOSH, M. (2018). MBSP: Multivariate Bayesian Model with Shrinkage Priors R package version 1.0.
[3] BAI, R., MORAN, G. E., ANTONELLI, J. L., CHEN, Y. and BOLAND, M. R. (2020). Spike-and-slab group lassos for grouped regression and sparse generalized additive models. Journal of the American Statistical Association 1-14. · Zbl 1506.62278
[4] BELLEC, P. C. and ROMON, G. (2021). Chi-square and normal inference in high-dimensional multi-task regression. arXiv preprint arXiv:2107.07828.
[5] BERTRAND, Q., MASSIAS, M., GRAMFORT, A. and SALMON, J. (2019). Handling correlated and repeated measurements with the smoothed multivariate square-root Lasso. arXiv preprint arXiv:1902.02509.
[6] BERTSIMAS, D., KING, A. and MAZUMDER, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics 44 813-852. · Zbl 1335.62115
[7] BOULESTEIX, A.-L. and STRIMMER, K. (2005). Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theoretical Biology and Medical Modelling 2 1-12.
[8] BREIMAN, L. and FRIEDMAN, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 3-54. · Zbl 0897.62068
[9] BROWN, P. J., VANNUCCI, M. and FEARN, T. (1998). Multivariate Bayesian variable selection and prediction. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60 627-641. · Zbl 0909.62022
[10] BROWN, P. J., VANNUCCI, M. and FEARN, T. (2002). Bayes model averaging with selection of regressors. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 519-536. · Zbl 1073.62004
[11] Bühlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media. · Zbl 1273.62015
[12] CHEN, K. (2019). rrpack: Reduced-Rank Regression R package version 0.1-11.
[13] CHEN, L. and HUANG, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association 107 1533-1545. · Zbl 1258.62075
[14] CHEVALIER, J.-A., GRAMFORT, A., SALMON, J. and THIRION, B. (2020). Statistical control for spatio-temporal MEG/EEG source imaging with desparsified multi-task Lasso. arXiv preprint arXiv:2009.14310.
[15] CHUN, H. and KELEŞ, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 3-25. · Zbl 1411.62184
[16] CHUNG, D., CHUN, H. and KELES, S. (2019). spls: Sparse Partial Least Squares (SPLS) Regression and Classification R package version 2.2-3.
[17] DESHPANDE, S. K., ROČKOVÁ, V. and GEORGE, E. I. (2019). Simultaneous variable and covariance selection with the multivariate spike-and-slab lasso. Journal of Computational and Graphical Statistics 28 921-931. · Zbl 07499036
[18] Frank, L. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics 35 109-135. · Zbl 0775.62288
[19] Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 1-22.
[20] FUJIKOSHI, Y. and SATOH, K. (1997). Modified AIC and Cp in multivariate linear regression. Biometrika 84 707-716. · Zbl 0888.62055
[21] GELMAN, A., CARLIN, J. B., STERN, H. S., DUNSON, D. B., VEHTARI, A. and RUBIN, D. B. (2013). Bayesian data analysis (3rd ed.). Chapman and Hall/CRC. · Zbl 1279.62004
[22] GUPTA, A. K. and NAGAR, D. K. (2018). Matrix variate distributions 104. CRC Press.
[23] GUPTA, A. K. and NAGAR, D. K. (2018). Matrix variate distributions 104. CRC Press.
[24] HANNIG, J., IYER, H., LAI, R. C. and LEE, T. C. (2016). Generalized fiducial inference: A review and new results. Journal of the American Statistical Association 111 1346-1361.
[25] JAMESON, G. (2013). Inequalities for gamma function ratios. The American Mathematical Monthly 120 936-940. · Zbl 1285.33004
[26] LAHIRI, S. N. (2021). Necessary and sufficient conditions for variable selection consistency of the LASSO in high dimensions. The Annals of Statistics 49 820-844. · Zbl 1469.62307
[27] LEE, T. I., RINALDI, N. J., ROBERT, F., ODOM, D. T., BAR-JOSEPH, Z., GERBER, G. K., HANNETT, N. M., HARBISON, C. T., THOMPSON, C. M., SIMON, I. et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. science 298 799-804.
[28] LEE, W. and LIU, Y. (2012). Simultaneous multiple response regression and inverse covariance matrix estimation via penalized Gaussian maximum likelihood. Journal of Multivariate Analysis 111 241-255. · Zbl 1259.62043
[29] Li, Y., Nan, B. and Zhu, J. (2015). Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 71 354-363. · Zbl 1390.62285
[30] LI, Y., NAN, B. and ZHU, J. (2016). MSGLasso: Multivariate Sparse Group Lasso for the Multivariate Multiple Linear Regression with an Arbitrary Group Structure R package version 2.1.
[31] LIQUET, B., MENGERSEN, K., PETTITT, A., SUTTON, M. et al. (2017). Bayesian variable selection regression of multivariate responses for group data. Bayesian Analysis 12 1039-1067. · Zbl 1384.62259
[32] LIQUET, B. and SUTTON, M. (2017). MBSGS: Multivariate Bayesian Sparse Group Selection with Spike and Slab R package version 1.1.0.
[33] MASSIAS, M., FERCOQ, O., GRAMFORT, A. and SALMON, J. (2018). Generalized concomitant multi-task lasso for sparse multimodal regression. In International Conference on Artificial Intelligence and Statistics 998-1007. PMLR.
[34] MOLSTAD, A. J. (2022). New Insights for the Multivariate Square-Root Lasso. Journal of Machine Learning Research 23 1-52.
[35] MUIRHEAD, R. J. (2009). Aspects of multivariate statistical theory 197. John Wiley & Sons.
[36] NARISETTY, N. N. and HE, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics 42 789-817. · Zbl 1302.62158
[37] PENG, J., ZHU, J., BERGAMASCHI, A., HAN, W., NOH, D.-Y., POLLACK, J. R. and WANG, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. The Annals of Applied Statistics 4 53. · Zbl 1189.62174
[38] PHILLIPS, T. (2008). Regulation of transcription and gene expression in eukaryotes. Nature Education 1 199.
[39] ROTHMAN, A. J., LEVINA, E. and ZHU, J. (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics 19 947-962.
[40] SIMILA, T. and TIKKA, J. (2006). Common subset selection of inputs in multiresponse regression. In The 2006 IEEE International Joint Conference on Neural Network Proceedings 1908-1915. IEEE. · Zbl 1452.62513
[41] SONDEREGGER, D. L. and HANNIG, J. (2014). Fiducial theory for free-knot splines. In Contemporary Developments in Statistical Theory 155-189. Springer. · Zbl 06312423
[42] SPARKS, R., COUTSOURIDES, D. and TROSKIE, L. (1983). The multivariate CP. Communications in Statistics-Theory and Methods 12 1775-1793. · Zbl 0552.62041
[43] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 267-288. · Zbl 0850.62538
[44] TU, S. (2020). On the Smallest Singular Value of Non-Centered Gaussian Designs. https://stephentu.github.io/writeups/non_centered_gaussian.pdf.
[45] TURLACH, B. A., VENABLES, W. N. and WRIGHT, S. J. (2005). Simultaneous variable selection. Technometrics 47 349-363. · doi:10.1198/004017005000000139
[46] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42 1166-1202. · Zbl 1305.62259
[47] VAN DE GEER, S. and STUCKY, B. (2016). \( \mathit{\chi}^2\)-confidence sets in high-dimensional regression. In Statistical analysis for high-dimensional data 279-306. Springer. · Zbl 1384.62251
[48] VELU, R. and REINSEL, G. C. (2013). Multivariate reduced-rank regression: theory and applications 136. Springer Science & Business Media.
[49] VERSHYNIN, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027.
[50] VOUNOU, M., NICHOLS, T. E., MONTANA, G. and INITIATIVE, A. D. N. (2010). Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach. Neuroimage 53 1147-1159.
[51] WANG, G., WANG, F., HUANG, Q., LI, Y., LIU, Y. and WANG, Y. (2015). Understanding transcription factor regulation by integrating gene expression and dnase i hypersensitive sites. BioMed research international 2015.
[52] WANG, L., CHEN, G. and LI, H. (2007). Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 23 1486-1494.
[53] WILLIAMS, J. P. and HANNIG, J. (2019). Nonpenalized variable selection in high-dimensional linear model settings via generalized fiducial inference. The Annals of Statistics 47 1723-1753. · Zbl 1419.62169
[54] WILLIAMS, J. P., XIE, Y. and HANNIG, J. (2019). The EAS approach for graphical selection consistency in vector autoregression models. arXiv preprint arXiv:1906.04812.
[55] WILMS, I. and CROUX, C. (2018). An algorithm for the multivariate group lasso with covariance estimation. Journal of Applied Statistics 45 668-681. · Zbl 1516.62663
[56] ZHAO, P. and YU, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research 7 2541-2563. · Zbl 1222.62008
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.