×

Variable selection for general index models via sliced inverse regression. (English) Zbl 1305.62234

Summary: Variable selection, also known as feature selection in machine learning, plays an important role in modeling high dimensional data and is key to data-driven scientific discoveries. We consider here the problem of detecting influential variables under the general index model, in which the response is dependent of predictors through an unknown function of one or more linear combinations of them. Instead of building a predictive model of the response given combinations of predictors, we model the conditional distribution of predictors given the response. This inverse modeling perspective motivates us to propose a stepwise procedure based on likelihood-ratio tests, which is effective and computationally efficient in identifying important variables without specifying a parametric relationship between predictors and the response. For example, the proposed procedure is able to detect variables with pairwise, three-way or even higher-order interactions among \(p\) predictors with a computational time of \(O(p)\) instead of \(O(p^{k})\) (with \(k\) being the highest order of interactions). Its excellent empirical performance in comparison with existing methods is demonstrated through simulation studies as well as real data examples. Consistency of the variable selection procedure when both the number of predictors and the sample size go to infinity is established.

MSC:

62J02 General nonlinear regression
62H25 Factor analysis and principal components; correspondence analysis
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

RSIR; hierNet
PDF BibTeX XML Cite
Full Text: DOI arXiv Euclid

References:

[1] Bien, J., Taylor, J. and Tibshirani, R. (2013). A LASSO for hierarchical interactions. Ann. Statist. 41 1111-1141. · Zbl 1292.62109
[2] Chen, C.-H. and Li, K.-C. (1998). Can SIR be as popular as multiple linear regression? Statist. Sinica 8 289-316. · Zbl 0897.62069
[3] Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V. B., Wong, E., Orlov, Y. L., Zhang, W., Jiang, J. et al. (2008). Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133 1106-1117.
[4] Cloonan, N., Forrest, A. R., Kolle, G., Gardiner, B. B., Faulkner, G. J., Brown, M. K., Taylor, D. F., Steptoe, A. L., Wani, S., Bethel, G. et al. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods 5 613-619.
[5] Cook, R. D. (2004). Testing predictor contributions in sufficient dimension reduction. Ann. Statist. 32 1062-1092. · Zbl 1092.62046
[6] Cook, R. D. (2007). Fisher lecture: Dimension reduction in regression. Statist. Sci. 22 1-26. · Zbl 1246.62148
[7] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407-499. · Zbl 1091.62054
[8] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547
[9] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849-911.
[10] Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302-332. · Zbl 1378.90064
[11] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531-537.
[12] Jiang, B. and Liu, J. S. (2014). Supplement to “Variable selection for general index models via sliced inverse regression.” . · Zbl 1305.62234
[13] Li, K.-C. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86 316-342. · Zbl 0742.62044
[14] Li, L. (2007). Sparse sufficient dimension reduction. Biometrika 94 603-613. · Zbl 1135.62062
[15] Li, L., Cook, R. D. and Nachtsheim, C. J. (2005). Model-free variable selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 285-299. · Zbl 1069.62053
[16] Li, R., Zhong, W. and Zhu, L. (2012). Feature screening via distance correlation learning. J. Amer. Statist. Assoc. 107 1129-1139. · Zbl 1443.62184
[17] Miller, A. J. (1984). Selection of subsets of regression variables. J. Roy. Statist. Soc. Ser. A 147 389-425. · Zbl 0584.62106
[18] Murphy, T. B., Dean, N. and Raftery, A. E. (2010). Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications. Ann. Appl. Stat. 4 396-421. · Zbl 1189.62105
[19] Ouyang, Z., Zhou, Q. and Wong, W. H. (2009). ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. USA 106 21521-21526.
[20] Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 1009-1030.
[21] Simon, N. and Tibshirani, R. (2012). A permutation approach to testing interactions in many dimensions. Preprint. Available at . · Zbl 1257.62080
[22] Szretter, M. E. and Yohai, V. J. (2009). The sliced inverse regression algorithm as a maximum likelihood procedure. J. Statist. Plann. Inference 139 3570-3578. · Zbl 1167.62402
[23] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267-288. · Zbl 0850.62538
[24] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567-6572.
[25] Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nat. Genet. 39 1167-1173.
[26] Zhong, W., Zeng, P., Ma, P., Liu, J. S. and Zhu, Y. (2005). RSIR: Regularized sliced inverse regression for motif discovery. Bioinformatics 21 4169-4175.
[27] Zhong, W., Zhang, T., Zhu, Y. and Liu, J. S. (2012). Correlation pursuit: Forward stepwise variable selection for index models. J. R. Stat. Soc. Ser. B Stat. Methodol. 74 849-870.
[28] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.