A weight function method for selection of proteins to predict an outcome using protein expression data. (English) Zbl 1461.62195

Summary: There are multiple feature selection methods available in the literature for removing unwanted features from modelling. The existing techniques have drawbacks of reproducibility due to random selection of training and validation datasets. In this study, we propose a new resampling approach for feature selection, which helps resolve this drawback. The method will allocate a weight value for all the features in the dataset, and candidate features are selected by placing a cut-off value for the feature weight. The illustrated example shows that the method could select ten features from a set of 254. Results are used to develop a predictive model with a predictive accuracy of 92.3% represented in terms of area under the ROC curve. The results show that the method can successfully select the relevant features which result in an excellent predictive model building compared to commonly used L1, L2, and elastic net regularisation.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62M20 Inference from stochastic processes and prediction
62D05 Sampling theory, sample surveys
92C40 Biochemistry, molecular biology


rda; glmnet; pROC
Full Text: DOI


[1] Dudoit, S.; J., Fridlyand; Speed, T. P., Comparison of discrimination methods for the classification of tumors using gene expression data, J. Amer. Statist. Assoc., 97, 457, 77-87 (2002) · Zbl 1073.62576
[2] Tyanova, S.; Temu, T.; Sinitcyn, P.; Carlson, A.; Hein, M. Y.; Geiger, T.; Mann, M.; Cox, J., The perseus computational platform for comprehensive analysis of (prote) omics data, Nat. Methods, 13, 9, 731 (2016)
[3] Joyce, A. R.; Palsson, B.Ø., The model organism as a system: integrating’omics’ data sets, Nat. Rev. Mol. Cell Biol., 7, 3, 198 (2006)
[4] Ambroise, C.; McLachlan, G. J., Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Nat. Acad. Sci., 99, 10, 6562-6566 (2002) · Zbl 1034.92013
[5] Ding, C.; Peng, H., Minimum redundancy feature selection from microarray gene expression data, J. Bioinform. Comput. Biol., 3, 02, 185-205 (2005)
[6] Li, T.; Zhang, C.; Ogihara, M., A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, 20, 15, 2429-2437 (2004)
[7] Jović, A.; Brkić, K.; Bogunović, N., A review of feature selection methods with applications, (2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (2015), IEEE), 1200-1205
[8] L. Yu, H. Liu, Feature selection for high-dimensional data: A fast correlation-based filter solution, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 856-863.
[9] Sánchez Maroño, N.; Alonso Betanzos, A.; Tombilla Sanromán, M., Filter methods for feature selection-a comparative study, (International Conference on Intelligent Data Engineering and Automated Learning (2007), Springer), 178-187
[10] Phuong, T. M.; Lin, Z.; Altman, R. B., Choosing snps using feature selection, (2005 IEEE Computational Systems Bioinformatics Conference (CSB’05) (2005), IEEE), 301-309
[11] Cule, E.; Vineis, P.; De Iorio, M., Significance testing in ridge regression for genetic data, BMC Bioinform., 12, 1, 372 (2011)
[12] Xing, E. P.; Jordan, M. I.; Karp, R. M., Feature selection for high-dimensional genomic microarray data, (ICML, Vol. 1 (2001), Citeseer), 601-608
[13] Hira, Z. M.; Gillies, D. F., A review of feature selection and feature extraction methods applied on microarray data, Adv. Bioinform. (2015)
[14] Bolón Canedo, V.; Sánchez Marono, N.; Alonso Betanzos, A.; Benítez, J. M.; Herrera, F., A review of microarray datasets and applied feature selection methods, Inform. Sci., 282, 111-135 (2014)
[15] Kuo, B. C.; Ho, H. H.; Li, C. H.; Hung, C. C.; Taur, J. S., A kernel-based feature selection method for svm with rbf kernel for hyperspectral image classification, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 7, 1, 317-326 (2013)
[16] Lazar, C.; Taminau, J.; Meganck, S.; Steenhoff, D.; Coletta, A.; Molter, C.; de Schaetzen, V.; Duque, R.; Bersini, H.; Nowe, A., A survey on filter techniques for feature selection in gene expression microarray analysis, IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB), 9, 4, 1106-1119 (2012)
[17] Li, Z.; Xie, W.; Liu, T., Efficient feature selection and classification for microarray data, PLoS One, 13, 8 (2018)
[18] Fan, X.; Shi, L.; Fang, H.; Cheng, Y.; Perkins, R.; Tong, W., Dna microarrays are predictive of cancer prognosis: a re-evaluation, Clin. Cancer Res., 16, 2, 629-636 (2010)
[19] Ma, S.; Huang, J., Penalized feature selection and classification in bioinformatics, Brief. Bioinform., 9, 5, 392-403 (2008)
[20] Platt, J., Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Adv. Large Margin Classif., 10, 3, 61-74 (1999)
[21] Friedman, J.; Hastie, T.; Tibshirani, R., glmnet: Lasso and Elastic-Net Regularized Generalized Linear ModelsR package version, 1(4) (2009)
[22] Cule, E.; De Iorio, M., Ridge regression in prediction problems: automatic choice of the ridge parameter, Genet. Epidemiol., 37, 7, 704-714 (2013)
[23] Hoerl, A. E.; Kennard, R. W., Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12, 1, 55-67 (1970) · Zbl 0202.17205
[24] Hastie, T.; Tibshirani, R., Efficient quadratic regularization for expression arrays, Biostatistics, 5, 3, 329-340 (2004) · Zbl 1154.62393
[25] Choi, N. H.; Shedden, K.; Xu, G.; Zhang, X.; Zhu, J., Comment: Ridge regression, ranking variables and improved principal component regression, Technometrics, 62, 4, 451-455 (2020)
[26] Gupta, M.; Gupta, B., A novel gene expression test method of minimizing breast cancer risk in reduced cost and time by improving svm-rfe gene selection method combined with lasso, J. Integr. Bioinform., 1, ahead-of-print (2020)
[27] Yang, S. P.; Emura, T., A bayesian approach with generalized ridge estimation for high-dimensional regression and testing, Comm. Statist. Simulation Comput., 46, 8, 6083-6105 (2017) · Zbl 1388.62214
[28] Tibshirani, R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Stat. Methodol., 58, 1, 267-288 (1996) · Zbl 0850.62538
[29] Huang, J.; Ma, S.; Zhang, C. H., Adaptive lasso for sparse high-dimensional regression models, Statist. Sinica, 1603-1618 (2008) · Zbl 1255.62198
[30] Barretina, J.; Caponigro, G.; Stransky, N.; Venkatesan, K.; Margolin, A. A.; Kim, S.; Wilson, C. J.; Lehár, J.; Kryukov, G. V.; Sonkin, D., The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity, Nature, 483, 7391, 603 (2012)
[31] Zou, H.; Hastie, T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol., 67, 2, 301-320 (2005) · Zbl 1069.62054
[32] Kalanxhi, E.; Hektoen, H. H.; Meltzer, S.; S., Dueland; Flatmark, K.; Ree, A. H., Circulating proteins in response to combined-modality therapy in rectal cancer identified by antibody array screening, BMC Cancer, 16, 1, 536 (2016)
[33] Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J. C.; Müller, M., Proc: an open-source package for r and s+ to analyze and compare roc curves, BMC Bioinform., 12, 1, 1-8 (2011)
[34] Helfenstein, U.; Steiner, M., The use of logistic discrimination and receiver operating characteristics (roc) analysis in dentistry, Community Dent. Health, 11, 3, 142-146 (1994)
[35] Obuchowski, N. A., Roc analysis, Amer. J. Roentgenol., 184, 2, 364-372 (2005)
[36] Lee, J. Y.; Kim, B. J.; Koo, H. K.; Kim, J.; Kim, J. M.; Kook, Y. H.; Kim, B. J., Diagnostic potential of igg and iga responses to mycobacterium tuberculosis antigens for discrimination among active tuberculosis, latent tuberculosis infection, and non-infected individuals, Microorganisms, 8, 7, 979 (2020)
[37] Saeys, Y.; Inza, I.; Larrañaga, P., A review of feature selection techniques in bioinformatics, Bioinformatics, 23, 19, 2507-2517 (2007)
[38] Tang, J.; Alelyani, S.; Liu, H., Feature selection for classification: A review, (Data Classification: Algorithms and Applications (2014)), 37 · Zbl 1377.68210
[39] Thomas, A.; Vishwakarma, G. K.; Bhattacharjee, A., Joint modeling of longitudinal and time-to-event data on multivariate protein biomarkers, J. Comput. Appl. Math., Article 113016 pp. (2020) · Zbl 1448.62189
[40] Bhattacharjee, A.; Vishwakarma, G. K.; Thomas, A., Bayesian state-space modeling in gene expression data analysis: An application with biomarker prediction, Math. Biosci., 305, 96-101 (2018) · Zbl 1409.92084
[41] Poss, A. M.; Maschek, J. A.; Cox, J. E.; Hauner, B. J.; Hopkins, P. N.; Hunt, S. C.; .W. L., Holland; Summers, S. A.; Playdon, M. C., Machine learning reveals serum sphingolipids as cholesterol-independent biomarkers of coronary artery disease, J. Clin. Invest., 130, 3 (2020)
[42] Rehman, O.; Zhuang, H.; Muhamed Ali, A.; Ibrahim, A.; Li, Z., Validation of mirnas as breast cancer biomarkers with a machine learning approach, Cancers, 11, 3, 431 (2019)
[43] White, C. A.; Salamonsen, L. A., A guide to issues in microarray analysis: application to endometrial biology, Reproduction, 130, 1, 1-13 (2005)
[44] Emura, T.; Matsui, S.; Rondeau, V., Survival Analysis with Correlated Endpoints: Joint Frailty-Copula Models (2019), Springer · Zbl 1429.62003
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.