×

A new variable selection approach using random forests. (English) Zbl 1365.62417

Summary: Random Forests are frequently applied as they achieve a high prediction accuracy and have the ability to identify informative variables. Several approaches for variable selection have been proposed to combine and intensify these qualities. An extensive review of the corresponding literature led to the development of a new approach that is based on the theoretical framework of permutation tests and meets important statistical properties. A comparison to another eight popular variable selection methods in three simulation studies and four real data applications indicated that: the new approach can also be used to control the test-wise and family-wise error rate, provides a higher power to distinguish relevant from irrelevant variables and leads to models which are located among the very best performing ones. In addition, it is equally applicable to regression and classification problems.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
05C80 Random graphs (graph-theoretic aspects)

Software:

party; UCI-ml; GeneSrF; C4.5; R
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Altmann, A.; Tolosi, L.; Sander, O.; Lengauer, T., Permutation importance: a corrected feature importance measure, Bioinformatics, 26, 10, 1340-1347, (2010), URL: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/10/1340
[2] Archer, K.; Kimes, R., Empirical characterization of random forest variable importance measures, Computational Statistics & Data Analysis, 52, 4, 2249-2260, (2008) · Zbl 1452.62027
[3] Austin, P. C.; Tu, J. V., Bootstrap methods for developing predictive models, The American Statistician, 58, 2, 131-137, (2004), URL: http://www.jstor.org/stable/27643521 · Zbl 1182.62093
[4] Benjamini, Y.; Yekutieli, D., The control of the false discovery rate in multiple testing under dependency, The Annals of Statistics, 29, 4, 1165-1188, (2001) · Zbl 1041.62061
[5] Boulesteix, A.-L.; Strobl, C.; Augustin, T.; Daumer, M., Evaluating microarray-based classifiers: an overview, Cancer Informatics, 6, 77-97, (2008)
[6] Breiman, L., Bagging predictors, Machine Learning, 24, 2, 123-140, (1996) · Zbl 0858.68080
[7] Breiman, L., Random forests, Machine Learning, 45, 1, 5-32, (2001) · Zbl 1007.68152
[8] Breiman, L., Cutler, A., 2008. Random forests. http://www.stat.berkeley.edu/users/breiman/RandomForests/cc_home.htm (accessed: 03.02.11).
[9] Breiman, L.; Friedman, J.; Stone, C. J.; Olshen, R. A., Classification and regression trees, (1984), Chapman & Hall/CRC, URL: http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20&path=ASIN/0412048418
[10] Chehata, N.; Guo, L.; Mallet, C., Airborne lidar feature selection for urban classification using random forests, Scanning, XXXVIII, c, 207-212, (2009), URL: http://www.mendeley.com/research/airborne-lidar-feature-selection-urban-classification-using-random-forests/
[11] Cutler, D. R.; Edwards, T. C.; Beard, K. H.; Cutler, A.; Hess, K. T.; Gibson, J.; Lawler, J. J., Random forests for classification in ecology, Ecology, 88, 11, 2783-2792, (2007), URL: http://www.esajournals.org/doi/abs/10.1890/07-0539.1
[12] Díaz-Uriarte, R.; Alvarez de Andrés, S., Gene selection and classification of microarray data using random forest, BMC Bioinformatics, 7, 1, 3, (2006), URL: http://www.biomedcentral.com/1471-2105/7/3
[13] Dobra, A.; Gehrke, J., Bias correction in classification tree construction, (Brodley, C. E.; Danyluk, A. P., Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, (2001), Morgan Kaufmann Williams College, Williamstown, MA, USA), 90-97
[14] Efron, B., Estimating the error rate of a prediction rule: improvement on cross-validation, Journal of the American Statistical Association, 78, 382, 316-331, (1983) · Zbl 0543.62079
[15] Efron, B.; Tibshirani, R. J., (An Introduction to the Bootstrap, Chapman & Hall/CRC Monographs on Statistics & Applied Probability, (1994), Chapman and Hall/CRC), URL: http://www.worldcat.org/isbn/0412042312
[16] Efron, B.; Tibshirani, R. J., Improvements on cross-validation: the.632+ bootstrap method, Journal of the American Statistical Association, 92, 438, 548-560, (1997) · Zbl 0887.62044
[17] Frank, A., Asuncion, A., 2010. UCI machine learning repository. URL: http://archive.ics.uci.edu/ml.
[18] Genuer, R., Michel, V., Eger, E., Thirion, B., 2010a. Random forests based feature selection for decoding FMRI data. In: Proceedings Compstat 2010, Paris, France. August, Number 267, pp. 1-8.
[19] Genuer, R., Morlais, I., Toussile, W., 2011. Gametocytes infectiousness to mosquitoes: variable selection using random forests, and zero inflated models. Research Report RR-7497, INRIA, 01. URL: http://hal.inria.fr/inria-00550980/en/.
[20] Genuer, R.; Poggi, J.-M.; Tuleau-Malot, C., Variable selection using random forests, Pattern Recognition Letters, 31, 14, 2225-2236, (2010), URL: http://www.sciencedirect.com/science/article/B6V15-4YNC1M2-2/2/933ac5ac7bf3d118fbaa2313fe369439
[21] Goldstein, B.; Hubbard, A.; Cutler, A.; Barcellos, L., An application of random forests to a genome-wide association dataset: methodological considerations & new findings, BMC Genetics, 11, 1, 49, (2010), URL: http://www.biomedcentral.com/1471-2156/11/49
[22] Good, P., Permutation tests: A practical guide to resampling methods for testing hypotheses, (2000), Springer, URL: http://www.worldcat.org/isbn/038798898X · Zbl 0942.62049
[23] Good, P., Introduction to statistics through resampling methods and R/S-plus, (2005), Wiley-Interscience New York · Zbl 1094.62002
[24] Guyon, I.; Elisseeff, A., An introduction to variable and feature selection, Journal of Machine Learning Research, 3, 1157-1182, (2003), URL: http://portal.acm.org/citation.cfm?id=944919.944968 · Zbl 1102.68556
[25] Harrison, D. J.; Rubinfeld, D. L., Hedonic housing prices and the demand for Clean air, Journal of Environmental Economics and Management, 5, 1, 81-102, (1978), URL: http://ideas.repec.org/a/eee/jeeman/v5y1978i1p81-102.html · Zbl 0375.90023
[26] Hastie, T.; Tibshirani, R.; Eisen, M.; Alizadeh, A.; Levy, R.; Staudt, L.; Chan, W.; Botstein, D.; Brown, P., ‘gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns, Genome Biology, 1, 2, (2000), research0003.1-research0003.21
[27] Hastie, T.; Tibshirani, R. J.; Friedman, J. H., The elements of statistical learning, (2009), Springer
[28] Hothorn, T., Hornik, K., Strobl, C., Zeileis, A., 2008. Party: a laboratory for recursive part(y)itioning. R package version 0.9-9993. URL: http://CRAN.R-project.org/package=party.
[29] Hothorn, T.; Hornik, K.; Zeileis, A., Unbiased recursive partitioning, Journal of Computational and Graphical Statistics, 15, 3, 651-674, (2006), URL: http://pubs.amstat.org/doi/abs/10.1198/106186006X133933
[30] Jiang, H.; Deng, Y.; Chen, H.-S.; Tao, L.; Sha, Q.; Chen, J.; Tsai, C.-J.; Zhang, S., Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes, BMC Bioinformatics, 5, 1, 81, (2004), URL: http://www.biomedcentral.com/1471-2105/5/81
[31] Kim, H.; Loh, W., Classification trees with unbiased multiway splits, Journal of the American Statistical Association, 96, 589-604, (2001)
[32] Kim, Y.; Wojciechowski, R.; Sung, H.; Mathias, R.; Wang, L.; Klein, A.; Lenroot, R.; Malley, J.; Bailey-Wilson, J., Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects, BMC Proceedings, 3, Suppl. 7, S64, (2009), URL: http://www.biomedcentral.com/1753-6561/3/S7/S64
[33] Lausen, B.; Sauerbrei, W.; Schumacher, M., Classification and regression trees (cart) used for the exploration of prognostic factors measured on different scales, (Dirschedl, P.; Ostermann, R., Computational Statistics, (1994), Physica-Verlag Heidelberg), 483-496
[34] Little, M.; McSharry, P.; Roberts, S.; Costello, D.; Moroz, I., Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection, BioMedical Engineering OnLine, 6, 1, 23, (2007), URL: http://www.biomedical-engineering-online.com/content/6/1/23
[35] Lunetta, K.; Hayward, B. L.; Segal, J.; Van Eerdewegh, P., Screening large-scale association study data: exploiting interactions using random forests, BMC Genetics, 5, 1, (2004)
[36] Nicodemus, K.; Malley, J.; Strobl, C.; Ziegler, A., The behaviour of random forest permutation-based variable importance measures under predictor correlation, BMC Bioinformatics, 11, 1, 110, (2010)
[37] Qiu, X.; Xiao, Y.; Gordon, A.; Yakovlev, A., Assessing stability of gene selection in microarray data analysis, BMC Bioinformatics, 7, 1, (2006)
[38] Quinlan, J. R., (C4.5: Programs for Machine Learning, Morgan Kaufmann Series in Machine Learning, (1993), Morgan Kaufmann), URL: http://www.worldcat.org/isbn/1558602380
[39] R Development Core Team, 2011. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: http://www.R-project.org/. ISBN: 3-900051-07-0.
[40] Rodenburg, W.; Heidema, A. G.; Boer, J. M.A.; Bovee-Oudenhoven, I. M.J.; Feskens, E. J.M.; Mariman, E. C.M.; Keijer, J., A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes, Physiological Genomics, 33, 1, 78-90, (2008), URL: http://physiolgenomics.physiology.org/content/33/1/78.abstract
[41] Sandri, M.; Zuccolotto, P., Variable selection using random forests, (Zani, S.; Cerioli, A.; Riani, M.; Vichi, M., Data Analysis, Classification and the Forward Search, Studies in Classification, Data Analysis, and Knowledge Organization, (2006), Springer Berlin, Heidelberg), 263-270
[42] Sauerbrei, W., The use of resampling methods to simplify regression models in medical statistics, Journal of the Royal Statistical Society. Series C. Applied Statistics, 48, 3, 313-329, (1999) · Zbl 0939.62114
[43] Sauerbrei, W.; Royston, P.; Binder, H., Selection of important variables and determination of functional form for continuous predictors in multivariable model building, Statistics in Medicine, 26, 30, 5512-5528, (2007)
[44] Schwarz, D.; Szymczak, S.; Ziegler, A.; König, I., Picking single-nucleotide polymorphisms in forests, BMC Proceedings, 1, Suppl. 1, S59, (2007), URL: http://www.biomedcentral.com/1753-6561/1/S1/S59
[45] Shao, J., Linear model selection by cross-validation, Journal of the American Statistical Association, 88, 422, 486-494, (1993) · Zbl 0773.62051
[46] Strobl, C.; Boulesteix, A.-L.; Augustin, T., Unbiased split selection for classification trees based on the gini index, Computational Statistics & Data Analysis, 52, 1, 483-501, (2007) · Zbl 1452.62469
[47] Strobl, C.; Boulesteix, A.-L.; Kneib, T.; Augustin, T.; Zeileis, A., Conditional variable importance for random forests, BMC Bioinformatics, 9, 1, 307+, (2008)
[48] Strobl, C.; Boulesteix, A.-L.; Zeileis, A.; Hothorn, T., Bias in random forest variable importance measures: illustrations, sources and a solution, BMC Bioinformatics, 8, 1, 25, (2007), URL: http://www.biomedcentral.com/1471-2105/8/25
[49] Strobl, C.; Malley, J.; Tutz, G., An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests, Psychological Methods, 14, 4, 323-348, (2009)
[50] Strobl, C., Zeileis, A., 2008. Danger: high power!—exploring the statistical properties of a test for random forest variable importance. URL: http://epub.ub.uni-muenchen.de/2111/.
[51] Svetnik, V.; Liaw, A.; Tong, C.; Wang, T., Application of breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules, (Roli, F.; Kittler, J.; Windeatt, T., Multiple Classifier Systems, Lecture Notes in Computer Science, vol. 3077, (2004), Springer Berlin, Heidelberg), 334-343
[52] Tang, R.; Sinnwell, J.; Li, J.; Rider, D.; de Andrade, M.; Biernacka, J., Identification of genes and haplotypes that predict rheumatoid arthritis using random forests, BMC Proceedings, 3, Suppl. 7, S68, (2009), URL: http://www.biomedcentral.com/1753-6561/3/S7/S68
[53] Touw, W. G.; Bayjanov, J. R.; Overmars, L.; Backus, L.; Boekhorst, J.; Wels, M.; van Hijum, S. A.F. T., Data mining in the life sciences with random forest: a walk in the park or lost in the jungle?, Briefings in Bioinformatics, (2012), URL: http://bib.oxfordjournals.org/content/early/2012/07/10/bib.bbs034.abstract
[54] van Wieringen, W. N.; Kun, D.; Hampel, R.; Boulesteix, A.-L., Survival prediction using gene expression data: a review and comparison, Computational Statistics & Data Analysis, 53, 5, 1590-1603, (2009), Statistical genetics & statistical genomics: where biology, epistemology, statistics, and computation collide. URL: http://www.sciencedirect.com/science/article/pii/S0167947308002946 · Zbl 1453.62225
[55] Venables, W. N.; Ripley, B. D., Modern applied statistics with S, (2003), Springer New York, USA, URL: http://www.worldcat.org/isbn/0387954570 · Zbl 1006.62003
[56] Wang, M.; Chen, X.; Zhang, H., Maximal conditional chi-square importance in random forests, Bioinformatics, 26, 6, 831-837, (2010), URL: http://bioinformatics.oxfordjournals.org/content/26/6/831.abstract
[57] White, A.; Liu, W., Bias in information based measures in decision tree induction, Machine Learning, 15, 3, 321-329, (1994) · Zbl 0942.68718
[58] Winham, S.; Colby, C.; Freimuth, R.; Wang, X.; de Andrade, M.; Huebner, M.; Biernacka, J., Snp interaction detection with random forests in high-dimensional genetic data, BMC Bioinformatics, 13, 1, 164, (2012), URL: http://www.biomedcentral.com/1471-2105/13/164
[59] Yang, W.; Gu, C. C., Selection of important variables by statistical learning in genome-wide association analysis, BMC Proceedings, 3, Suppl. 7, S70, (2009), URL: http://www.biomedcentral.com/1753-6561/3/S7/S70
[60] Zhang, P., Model selection via multifold cross validation, Annals of Statistics, 21, 1, 299-313, (1993), URL: http://www.jstor.org/stable/3035592 · Zbl 0770.62053
[61] Zhou, Q.; Hong, W.; Luo, L.; Yang, F., Gene selection using random forest and proximity differences criterion on DNA microarray data, Journal of Convergence Information Technology, 5, 6, 161-170, (2010)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.