×

zbMATH — the first resource for mathematics

A multicriteria approach to find predictive and sparse models with stable feature selection for high-dimensional data. (English) Zbl 1397.92016
Summary: Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.

MSC:
92B15 General biostatistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
92D10 Genetics and epigenetics
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Lang, M.; Kotthaus, H.; Marwedel, P.; Weihs, C.; Rahnenführer, J.; Bischl, B., Automatic model selection for high-dimensional survival analysis, Journal of Statistical Computation and Simulation, 85, 1, 62-76, (2015)
[2] Kalousis, A.; Prados, J.; Hilario, M., Stability of feature selection algorithms: a study on high-dimensional spaces, Knowledge and Information Systems, 12, 1, 95-116, (2007)
[3] He, Z.; Yu, W., Stable feature selection for biomarker discovery, Computational Biology and Chemistry, 34, 4, 215-225, (2010) · Zbl 1403.92068
[4] Lausser, L.; Müssel, C.; Maucher, M.; Kestler, H. A., Measuring and visualizing the stability of biomarker selection techniques, Computational Statistics, 28, 1, 51-65, (2013) · Zbl 1305.65052
[5] Nogueira, S.; Brown, G., Measuring the stability of feature selection, Machine Learning and Knowledge Discovery in Databases. Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, 9852, 442-457, (2016), Cham: Springer International Publishing, Cham
[6] Alelyani, S.; Zhao, Z.; Liu, H., A dilemma in assessing stability of feature selection algorithms, Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
[7] Wang, H.; Khoshgoftaar, T. M.; Wald, R.; Napolitano, A., A novel dataset-similarity-aware approach for evaluating stability of software metric selection techniques, Proceedings of the 2012 IEEE 13th International Conference on Information Reuse and Integration, IRI 2012
[8] Meinshausen, N.; Bühlmann, P., Stability selection, Journal of the Royal Statistical Society. Series B. Statistical Methodology, 72, 4, 417-473, (2010) · Zbl 1411.62142
[9] Boulesteix, A.-L.; Slawski, M., Stability and aggregation of ranked gene lists, Briefings in Bioinformatics, 10, 5, 556-568, (2009)
[10] Lee, S.; Rahnenführer, J.; Lang, M., Robust selection of cancer survival signatures from high-throughput genomic data using two-fold subsampling, PLoS ONE, 9, 10, (2014)
[11] Awada, W.; Khoshgoftaar, T. M.; Dittman, D.; Wald, R.; Napolitano, A., A review of the stability of feature selection techniques for bioinformatics data, Proceedings of the 2012 IEEE International Conference on Information Reuse and Integration
[12] Abeel, T.; Helleputte, T.; Van de Peer, Y.; Dupont, P.; Saeys, Y., Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics, 26, 3, 392-398, (2009)
[13] Davis, C. A.; Gerick, F.; Hintermair, V.; Friedel, C. C.; Fundel, K.; Küffner, R.; Zimmer, R., Reliable gene signatures for microarray classification: assessment of stability and performance, Bioinformatics, 22, 19, 2356-2363, (2006)
[14] Dessì, N.; Pascariello, E.; Pes, B., A comparative analysis of biomarker selection techniques, BioMed Research International, 2013, (2013)
[15] Dittman, D.; Khoshgoftaar, T. M.; Wald, R.; Wang, H., Stability analysis of feature ranking techniques on biological datasets, Proceedings of the 2011 IEEE International Conference on Bioinformatics and Biomedicine
[16] Haury, A.; Gestraud, P.; Vert, J., The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures, PLoS ONE, 6, 12, (2011)
[17] Lee, H. W.; Lawton, C.; Na, Y. J.; Yoon, S., Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery, Statistical Applications in Genetics and Molecular Biology, 12, 2, 207-223, (2013)
[18] Saeys, Y.; Abeel, T.; Van De Peer, Y., Robust feature selection using ensemble feature selection techniques, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5212, 2, 313-325, (2008)
[19] Schirra, L.-R.; Lausser, L.; Kestler, H. A., Analysis of Large and Complex Data, (2016), Springer
[20] P. Jaccard, Étude comparative de la distribution florale dans une portion des alpes et du jura, Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547-579, (1901)
[21] Dice, L. R., Measures of the amount of ecologic association between species, Ecology, 26, 3, 297-302, (1945)
[22] Ochiai, A., Zoogeographic studies on the soleoid fishes found in japan and its neighbouring regions, ” Bulletin of the Japanese Society for the Science of Fish, 22, 9, 526-530, (1957)
[23] Zucknick, M.; Richardson, S.; Stronach, E., Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods, Statistical Applications in Genetics and Molecular Biology, 7, 1, 1-34, (2008) · Zbl 1276.92033
[24] Lustgarten, J. L.; Gopalakrishnan, V.; Visweswaran, S., Measuring stability of feature selection in biomedical datasets, AMIA ... Annual Symposium proceedings/AMIA Symposium. AMIA Symposium, 2009, 406-410, (2009)
[25] Novovicová, J.; Somol, P.; Pudil, P., A new measure of feature selection algorithms’ stability, Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, ICDMW 2009
[26] Somol, P.; Novovicová, J., Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 11, 1921-1939, (2010)
[27] Kuncheva, L. I., A stability index for feature selection, Proceedings of the 25th IASTED International Conference on Artificial Intelligence and Applications (AIA ’07)
[28] Sammut, C.; Webb, G. I., Encyclopedia of Machine Learning, (2011), New York, NY, USA: Springer, New York, NY, USA · Zbl 1211.68001
[29] Peng, H.; Long, F.; Ding, C., Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 8, 1226-1238, (2005)
[30] Hofner, B.; Mayr, A.; Robinzonov, N.; Schmid, M., Model-based boosting in R: a hands-on tutorial using the R Package mboost, Computational Statistics, 29, 1-2, 3-35, (2014) · Zbl 1306.65069
[31] Bühlmann, P.; Yu, B., Boosting with the L2 loss, Journal of the American Statistical Association, 98, 462, 324-339, (2003) · Zbl 1041.62029
[32] Yuan, G.-X.; Ho, C.-H.; Lin, C.-J., Am improved GLMNET for L1-regularized logistic regression, Journal of Machine Learning Research (JMLR), 13, 1, 1999-2030, (2012) · Zbl 1432.68404
[33] Izenman, A. J., Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning, (2013), New York, NY, USA: Springer, New York, NY, USA
[34] Miettinen, K., Nonlinear Multiobjective Optimization, (2004), Norwell, Mass, USA: Kluwer Academic Publishers, Norwell, Mass, USA
[35] Stiglic, G.; Kokol, P., Stability of ranked gene lists in large microarray analysis studies, Journal of Biomedicine and Biotechnology, 2010, (2010)
[36] Vanschoren, J.; van Rijn, J. N.; Bischl, B.; Torgo, L., OpenML: networked science in machine learning, ACM SIGKDD Explorations Newsletter, 15, 2, 49-60, (2013)
[37] The Cancer Genome Atlas Research Network, Comprehensive molecular characterization of gastric adenocarcinoma, Nature, 513, 202-209, (2014)
[38] Core, R.; R Core Team, R., Team, A Language and, (2016), Vienna: Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna
[39] Bischl, B.; Lang, M.; Kotthoff, L., mlr: machine learning in R, Journal of Machine Learning Research (JMLR), 17, 170, 1-5, (2016) · Zbl 1392.68007
[40] Bischl, B.; Lang, M.; Mersmann, O.; Rahnenführer, J.; Weihs, C., Batchjobs and batchexperiments: Abstraction mechanisms for using R in batch environments, Journal of Statistical Software, 64, 11, 1-25, (2015)
[41] Lang, M., fmrmr: Fast mRMR, R package version 0.1, (2015)
[42] Karatzoglou, A.; Hornik, K.; Smola, A.; Zeileis, A., kernlab—an S4 package for kernel methods in R, Journal of Statistical Software, 11, 9, 1-20, (2004)
[43] Helleputte, T.; Gramme, P., LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library, R package version 1.94-2, (2015)
[44] Hothorn, T.; Bühlmann, P.; Kneib, T.; Schmid, M.; Hofner, B.; Bühlmann, P., mboost: Model-Based Boosting, R package version 2.6-0, 2, (2015), mboost: Model-Based Boosting, mboost
[45] Wright, M. N.; Ziegler, A., Ranger: a fast implementation of random forests for high dimensional data in C++ and R, Journal of Statistical Software, 77, 1, (2017)
[46] Sing, T.; Sander, O.; Beerenwinkel, N.; Lengauer, T., ROCR: visualizing classifier performance in R, Bioinformatics, 21, 20, 3940-3941, (2005)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.