×

zbMATH — the first resource for mathematics

Robust feature screening procedures for single and mixed types of data. (English) Zbl 07194333
Summary: Feature screening procedures aim to reducing the dimensionality of data with exponentially-growing dimensions. Existing procedures all focused on a single type of predictors, which are either all continuous or all discrete. They cannot address mixed types of variables, outliers, or nonlinear trends. In this paper we first propose new feature screening procedure(s) for different continuous/discrete combinations of response and predictor variables. They are respectively based on marginal Spearman correlation, marginal ANOVA test, marginal Kruskal-Wallis test, Kolmogorov-Smirnov test, Mann-Whitney test, and smoothing splines modeling. Extensive simulation studies are performed to compare the new and existing procedures, with the aim of identifying a best robust screening procedure for each single type of data. Then we combine these best screening procedures to form the robust feature screening procedure for mixed type of data. We demonstrate its robustness against outliers and model misspecification through simulation studies and a real example.
MSC:
62F07 Statistical ranking and selection procedures
62G10 Nonparametric hypothesis testing
Software:
gamair; gss
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Akaike H. Information theory and an extension of the maximum likelihood principle. In Second international symposium on information theory, Vol. 1. Budapest: Akademinai Kiado; 1973. p. 267-281. [Google Scholar] · Zbl 0283.62006
[2] Schwarz H.Estimating the dimension of a model. Ann Stat. 1978;6:461-464. doi: 10.1214/aos/1176344136[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0379.62005
[3] Mallows C.Some comments on cp. Technometrics. 1973;15:661-675. [Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 0269.62061
[4] Foster D, George E.The risk inflation criterion for multiple regression. Ann Stat. 1994;22:1947-1975. doi: 10.1214/aos/1176325766[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0829.62066
[5] Stone M.Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B. 1974;36:111-147. [Google Scholar] · Zbl 0308.62063
[6] Barron A, Birge L, Massart P.Risk bounds for model selection via penalization. Probab Theory Relat Fields. 1999;113:301-413. doi: 10.1007/s004400050210[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0946.62036
[7] Frank L, Friedman J.A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109-135. doi: 10.1080/00401706.1993.10485033[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 0775.62288
[8] Tibshirani R.Regression shrinkage and selection via the lasso. J R Stat Soc Ser B. 1996;58:267-288. [Crossref], [Google Scholar] · Zbl 0850.62538
[9] Zou H, Hastie T.Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2005;67:301-320. doi: 10.1111/j.1467-9868.2005.00503.x[Crossref], [Google Scholar] · Zbl 1069.62054
[10] Zou H.The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418-1429. doi: 10.1198/016214506000000735[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1171.62326
[11] Fan J, Li R.Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348-1360. doi: 10.1198/016214501753382273[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1073.62547
[12] Zhang C.Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38:894-942. doi: 10.1214/09-AOS729[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1183.62120
[13] Chen Y, Du P, Wang Y.Variable selection in linear models. WIREs Comput Stat. 2014;6:1-9. doi: 10.1002/wics.1284[Crossref], [Google Scholar]
[14] Fan J, Lv J.Sure independence screening for ultra-high dimensional feature space. J R Stat Soc Ser B. 2008;70:849-911. doi: 10.1111/j.1467-9868.2008.00674.x[Crossref], [PubMed], [Web of Science ®], [Google Scholar] · Zbl 1411.62187
[15] Wang H.Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc. 2009;104:1512-1524. doi: 10.1198/jasa.2008.tm08516[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1205.62103
[16] Chen J, Chen Z.Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95:759-771. doi: 10.1093/biomet/asn034[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1437.62415
[17] Hall P, Miller H.Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat. 2009;18:533-550. doi: 10.1198/jcgs.2009.08041[Taylor & Francis Online], [Web of Science ®], [Google Scholar]
[18] Li G, Peng H, Zhang J, et al. Robust rank correlation based screening. Ann Stat. 2012;40:1846-1877. doi: 10.1214/12-AOS1024[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1257.62067
[19] Fan J, Samworth R, Wu Y.Ultra-high dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009;10:2013-2038. [PubMed], [Web of Science ®], [Google Scholar] · Zbl 1235.62089
[20] Fan J, Song R.Sure independence screening in generalized linear models with np-dimensionality. Ann Stat. 2010;38:3567-3604. doi: 10.1214/10-AOS798[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1206.68157
[21] Fan J, Feng Y, Song R.Nonparametric independence screening in sparse ultra-high dimensional additive models. J Am Stat Assoc. 2011;106:544-557. doi: 10.1198/jasa.2011.tm09779[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1232.62064
[22] Zhu L, Li L, Li R, et al. Model-free feature screening for ultrahigh-dimensional data. J Amer Stat Assoc. 2011;106:1464-1475. doi: 10.1198/jasa.2011.tm10563[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1233.62195
[23] Huang D, Li R, Wang H.Feature screening for ultrahigh-dimensional categorical data with applications. J Bus Econ Stat. 2014;32:237-244. doi: 10.1080/07350015.2013.863158[Taylor & Francis Online], [Web of Science ®], [Google Scholar]
[24] Fan J, Fan Y.High dimensional classification using features annealed independence rules. Ann Stat. 2008;36:2605-2637. doi: 10.1214/07-AOS504[Crossref], [PubMed], [Web of Science ®], [Google Scholar] · Zbl 1360.62327
[25] Mai Q, Zou H.The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika. 2013;100:229-234. doi: 10.1093/biomet/ass062[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1452.62456
[26] Cui H, Li R, Zhong W.Model-free feature screening for ultra-high dimensional discriminant analysis. J Am Stat Assoc. 2015;110:630-641. doi: 10.1080/01621459.2014.920256[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1373.62305
[27] Geoman J, van de Geer S, de Kort F, et al. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93-99. doi: 10.1093/bioinformatics/btg382[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[28] Kim S, Volsky D.Page: parametric analysis of gene set enrichment. Bioinformatics. 2005;6:144. [PubMed], [Web of Science ®], [Google Scholar]
[29] Mansmann U, Meister R.Testing differential gene expression in functional groups. Goeman’s global test versus an ancova approach. Methods Inf Med. 2005;44:449-453. doi: 10.1055/s-0038-1633982[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[30] Wang K, Li M, Bucan M.Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet. 2007;81:1278-1283. doi: 10.1086/522374[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[31] Holden M, Deng S, Wojnowski L, et al. Gsea-snp: applying gene set enrichment analysis to snp data from genome-wide association studies. Bioinformatics. 2008;24:2784-2785. doi: 10.1093/bioinformatics/btn516[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[32] Zhong H, Yang X, Kaplan L, et al. Integrating pathway analysis and genetics of gene expression for genome-wide association studies. Am J Hum Genet. 2010;86:581-591. doi: 10.1016/j.ajhg.2010.02.020[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[33] Xiong Q, et al. Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets. Genome Res. 2012;22:386-397. doi: 10.1101/gr.124370.111[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[34] Ma X, Zhang J.Robust model-free feature screening via quantile correlation. J Multivar Anal. 2016;143:472-480. doi: 10.1016/j.jmva.2015.10.010[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1328.62249
[35] Li R, Zhong W, Zhu L.Feature screening via distance correlation learning. J Am Stat Assoc. 2012;107:1129-1139. doi: 10.1080/01621459.2012.695654[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1443.62184
[36] Kruskal W, Wallis W.Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952;47:583-621. doi: 10.1080/01621459.1952.10483441[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 0048.11703
[37] Gu C. Smoothing spline ANOVA models. 2nd ed.New York: Springer-Verlag; 2013. [Crossref], [Google Scholar] · Zbl 1269.62040
[38] Wood SN. Generalized additive models: an introduction with r. 2nd ed.Boca Raton: Chapman and Hall; 2017. [Crossref], [Google Scholar] · Zbl 1368.62004
[39] Kim Y, Gu C.Smoothing spline gaussian regression: more scalable computation via efficient approximation. J R Stat Soc Ser B. 2004;66:337-356. doi: 10.1046/j.1369-7412.2003.05316.x[Crossref], [Google Scholar] · Zbl 1062.62067
[40] Gupta V, Srinivasan S, Kudli S. Prediction and classification of cardiac arrhythmia. Stanford (CA): Department of Statistics, Stanford University; 2014. [Google Scholar]
[41] Mitra M, Samanta R.Cardiac arrhythmia classification using neural networks with selected features. Proc Technol. 2013;10:76-84. doi: 10.1016/j.protcy.2013.12.339[Crossref], [Google Scholar]
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.