zbMATH — the first resource for mathematics

Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. (English) Zbl 1406.92192
Summary: At present, the study of gene expression data provides a reference for tumor diagnosis at the molecular level. It is a challenging task to select the feature genes related to the classification from the high-dimensional and small-sample gene expression data and successfully separate the different subtypes of tumor or between the normal and patient. In this paper, we present a new method for tumor classification – relaxed Lasso (least absolute shrinkage and selection operator) and generalized multi-class support vector machine (rL-GenSVM). The tumor datasets are firstly z-score normalized. Secondly, using relaxed Lasso to select feature gene sets on training set, and finally, generalized multi-class support vector machine (GenSVM) serves as a classifier. We select four two-class datasets and four multi-class datasets for experiments. And four classifiers are used to predict and compare the classification accuracy on test set. To compare with other proposed methods, we obtain satisfactory classification accuracy by 10-fold cross-validation on all samples of each dataset. The experimental results show that the method proposed in this paper selects fewer feature genes and achieves higher classification accuracy. rL-GenSVM uses regularization parameters to avoid overfitting and can be widely applied to high-dimensional and small-sample tumor data classification. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/rL-GenSVM/.

92C40 Biochemistry, molecular biology
68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J05 Linear regression; mixed models
62J07 Ridge regression; shrinkage estimators (Lasso)
Full Text: DOI
[1] Alizadeh, A. A.; Eisen, M. B.; Davis, R. E.; Lossos, I. S.; Rosenwald, A.; Boldrick, J. C.; Staudt, L. M.; Sabet, H.; Tran, T.; Yu, X.; Powell, J.; Yang, L.; Marti, G. E.; Moore, T.; Hudson, J.; Lu, L.; Lewis, D. B.; Tibshirani, R.; Sherlock, G.; Chan, W. C.; Greiner, T. C.; Weisenburger, D. D.; Armitage, J. O.; Warnke, R. A.; Levy, R.; Wilson, W. H.; Grever, M. R.; Byrd, J. C.; Botstein, D.; Brown, P. O., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 403, 503-511, (2000)
[2] Allwein, E. L.; Schapire, R. E.; Singer, Y., Reducing multiclass to binary: a unifying approach for margin classifiers, J. Mach. Learn. Res., 1, 113-141, (2000) · Zbl 1013.68175
[3] Armstrong, S. A.; Staunton, J. E.; Silverman, L. B.; Pieters, R.; den Boer, M. L.; Minden, M. D.; Sallan, S. E.; Lander, E. S.; Golub, T. R.; Korsmeyer, S. J., MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia, Nat. Genet., 30, 41-47, (2002)
[4] Aziz, R.; Verma, C. K.; Srivastava, N., A novel approach for dimension reduction of microarray, Comput. Biol. Chem., 71, 161-169, (2017)
[5] Bakir, D.; James, A. P.; Zollanvari, A., An efficient method to estimate the optimum regularization parameter in RLDA, Bioinformatics, 32, 3461-3468, (2016)
[6] Becker, N.; Toedt, G.; Lichter, P.; Benner, A., Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data, BMC Bioinform, 12, 138, (2011), -138
[7] Beer, D. G.; Kardia, S. L.; Huang, C. C.; Giordano, T. J.; Levin, A. M.; Misek, D. E.; Lizyness, M. L.; Kuick, R.; Hayasaka, S.; Taylor, J. M.; Iannettoni, M. D.; Orringer, M. B.; Hanash, S., Gene-expression profiles predict survival of patients with lung adenocarcinoma, Nat. Med., 8, 816-824, (2002)
[8] Borczuk, A. C.; Kim, H. K.; Yegen, H. A.; Friedman, R. A.; Powell, C. A., Lung adenocarcinoma global profiling identifies type ii transforming growth factor-β receptor as a repressor of invasiveness, Am. J. Resp. Crit. Care., 172, 729-737, (2005)
[9] Cawley, G. C.; Talbot, N. L., On over-fitting in model selection and subsequent selection bias in performance evaluation, J. Mach. Learn. Res., 11, 2079-2107, (2010) · Zbl 1242.62051
[10] Chen, H.; Zhang, Y.; Gutman, I., A kernel-based clustering method for gene selection with gene expression data, J. Biomed. Inform., 62, 12-20, (2016)
[11] Chen, S. B.; Zhang, Y.; Ding, C. H.; Zhou, Z. L.; Luo, B., A discriminative multi-class feature selection method via weighted l2, 1-norm and Extended Elastic Net, Neurocomputing, 275, 1140-1149, (2018)
[12] Chen, X.; Jian, C., Gene expression data clustering based on graph regularized subspace segmentation, Neurocomputing, 143, 44-50, (2014)
[13] Cohen, J., A coefficient of agreement for nominal scales, Educ. Psychol. Meas., 20, 37-46, (1960)
[14] Crammer, K.; Singer, Y., On the learnability and design of output codes for multiclass problems, Mach. Learn., 47, 201-233, (2002) · Zbl 1012.68155
[15] Crammer, K.; Singer, Y., On the algorithmic implementation of multiclass kernel-based vector machines, J. Mach. Learn. Res., 2, 265-292, (2001) · Zbl 1037.68110
[16] Czajkowski, M.; Grześ, M.; Kretowski, M., Multi-test decision tree and its application to microarray data classification, Artif. Intell. Med., 61, 35-44, (2014)
[17] Dagliyan, O.; Uney-Yuksektepe, F.; Kavakli, I. H.; Turkay, M., Optimization based tumor classification from microarray gene expression data, PloS ONE, 6, e14579, (2011)
[18] Dhole, K.; Singh, G.; Pai, P. P.; Mondal, S., Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier, J. Theor. Biol., 348, 47-54, (2014)
[19] Efron, B.; Hastie, T.; Johnstone, I. M.; Tibshirani, R., Least angle regression. Ann, Stat, 32, 407-499, (2004) · Zbl 1091.62054
[20] Fan, J.; Li, R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc., 96, 1348-1360, (2001) · Zbl 1073.62547
[21] Friedman, J.; Hastie, T.; Tibshirani, R., Regularization paths for generalized linear models via coordinate descent, J. Stat. Softw., 33, 1-22, (2010)
[22] Genuer, R.; Poggi, J. M.; Tuleau-Malot, C., Variable selection using random forests, Pattern Recogn. Lett., 31, 2225-2236, (2010)
[23] Guan, P.; Huang, D.; He, M.; Zhou, B., Lung cancer gene expression database analysis incorporating prior knowledge with support vector machine-based classification method, J. Exp. Clin. Canc. Res., 28, 103, (2009), -103
[24] Guermeur, Y.; Monfrini, E., A quadratic loss multi-class SVM for which a radius-margin bound applies, Informatica, 22, 73-96, (2011) · Zbl 1263.68132
[25] Guo, S.; Guo, D.; Chen, L.; Jiang, Q., A centroid-based gene selection method for microarray data classification, J. Theor. Biol., 400, 32-41, (2016) · Zbl 1343.92012
[26] Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V., Gene selection for cancer classification using support vector machines, Mach. Learn., 46, 389-422, (2002) · Zbl 0998.68111
[27] Hsieh, C.; Chang, K.; Lin, C.; Keerthi, S. S.; Sundararajan, S., A dual coordinate descent method for large-scale linear SVM, (Proc. 25nd International Conference on Machine Learning, (2008)), 408-415
[28] Huang, L. T., An integrated method for cancer classification and rule extraction from microarray data, J. Biomed. Sci., 16, 1-10, (2009)
[29] Huerta, E. B.; Duval, B.; Hao, J. K., A hybrid LDA and genetic algorithm for gene selection and classification of microarray data, Neurocomputing, 73, 2375-2383, (2010)
[30] Jain, I.; Jain, V. K.; Jain, R., Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification, Appl. Soft. Comput., 62, 203-215, (2018)
[31] Kar, S.; Sharma, K. D.; Maitra, M., Gene selection from microarray gene expression data for classification of cancer subgroups employing PSO and adaptive k-nearest neighborhood technique, Expert Syst. Appl., 42, 612-627, (2015)
[32] Kononenko, I., Estimating attributes: analysis and extensions of RELIEF, Proc. ECML, 94, 171-182, (1994)
[33] Kruskal, W. H.; Wallis, W. A., Use of ranks in one-criterion variance analysis, J. Am. Stat. Assoc., 47, 583-621, (1952) · Zbl 0048.11703
[34] Lee, S. I.; Lee, H.; Abbeel, P.; Ng, A. Y., Efficient L1 regularized logistic regression, In AAAI-06, 401-408, (2006)
[35] Li, L.; Weinberg, C. R.; Darden, T. A.; Pedersen, L. G., Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method, Bioinformatics, 17, 1131-1142, (2001)
[36] Li, W.; Liao, B.; Zhu, W.; Chen, M.; Peng, L.; Wei, X.; Gu, C.; Li, K., Maxdenominator reweighted sparse representation for tumor classification, Sci. Rep., 7, 46030, (2017)
[37] Liu, Z.; Tang, D.; Cai, Y.; Wang, R.; Chen, F., A hybrid method based on ensemble WELM for handling multi class imbalance in cancer microarray data, Neurocomputing, 266, 641-650, (2017)
[38] Lu, H.; Chen, J.; Yan, K.; Jin, Q.; Xue, Y.; Gao, Z., A hybrid feature selection algorithm for gene expression data classification, Neurocomputing, 256, 56-62, (2017)
[39] Lv, J.; Peng, Q.; Chen, X.; Sun, Z., A multi-objective heuristic algorithm for gene expression microarray data classification, Expert. Syst. Appl., 59, 13-19, (2016)
[40] Meinshausen, N., Relaxed Lasso, Comput. Stat. Data. An., 52, 374-393, (2007) · Zbl 05560166
[41] Mramor, M.; Leban, G.; Demsar, J.; Zupan, B., Visualization-based cancer microarray data classification analysis, Bioinformatics, 23, 2147-2154, (2007)
[42] Nanni, L.; Lumini, A., Orthogonal linear discriminant analysis and feature selection for micro-array data classification, Expert Syst. Appl., 37, 7132-7137, (2010)
[43] Nutt, C. L.; Mani, D. R.; Betensky, R. A.; Tamayo, P.; Cairncross, J. G.; Ladd, C.; Pohl, U.; Hartmann, C.; McLaughlin, M. E.; Batchelor, T. T.; Black, P. M.; von Deimling, A.; Pomeroy, S. L.; Golub, T. R.; Louis, D. N., Gene expression-based classification of malignant gliomas correlates better with survival than histological classification, Cancer Res, 63, 1602-1607, (2003)
[44] Osareh, A.; Shadgar, B., An efficient ensemble learning method for gene microarray classification, Biomed Res. Int., (2013), 2013
[45] Peng, H.; Long, F.; Ding, C., Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell., 27, 1226-1238, (2005)
[46] Petricoin, E.; Ardekani, A.; Hitt, B.; Levine, P.; Fusaro, V.; Steinberg, S.; Mills, G.; Simone, C.; Fishman, D.; Kohn, E.; Liotta, L. A., Use of proteomic patterns in serum to identify ovarian cancer, Lancet, 359, 572-577, (2002)
[47] Pomeroy, S. L.; Tamayo, P.; Gaasenbeek, M.; Sturla, L. M.; Angelo, M.; McLaughlin, M. E.; Allen, J. C.; Zagzag, D.; Olson, J. M.; Curran, T.; Wetmore, C.; Biegel, J. A.; Poggio, T.; Mukherjee, S.; Rifkin, R.; Califano, A.; Stolovizky, G.; Louis, D. N.; Mesirov, J. P.; Lander, E. S.; Golub, T. R., Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature, 415, 436, (2002)
[48] Rifkin, R.; Klautau, A., In defense of one-vs-all classification, J. Mach. Learn. Res., 5, 101-141, (2004) · Zbl 1222.68287
[49] Salem, H.; Attiya, G.; El-Fishawy, N., Classification of human cancer diseases by gene expression profiles, Appl. Soft. Comput., 50, 124-134, (2017)
[50] Shahbeig, S.; Helfroush, M. S.; Rahideh, A., A fuzzy multi-objective hybrid TLBO-PSO approach to select the associated genes with breast cancer, Signal Process, 131, 58-65, (2017)
[51] Shipp, M. A.; Ross, K. N.; Tamayo, P.; Weng, A. P.; Kutok, J. L.; Aguiar, R. C.T.; Gaasenbeek, M.; Angelo, M.; Reich, M. R.; Pinkus, G. S.; Ray, T. S.; Koval, M.; Norton, A. J.; Lister, T. A.; Mesirov, J. P.; Neuberg, D.; Lander, E. S.; Aster, J. C.; Golub, T. R., Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning, Nat. Med., 8, 68-74, (2002)
[52] Sing, T.; Sander, O.; Beerenwinkel, N.; Lengauer, T., ROCR: visualizing classifier performance in R, Bioinformatics, 21, 3940-3941, (2005)
[53] Statnikov, A.; Aliferis, C. F.; Tsamardinos, I.; Hardin, D. P.; Levy, S., A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis, Bioinformatics, 21, 631-643, (2005)
[54] Stienstra, R.; Saudale, F.; Duval, C.; Keshtkar, S.; Groener, J. N.; Rooijen, van; Staels, B.; Kersten, S.; Mnller, M., Kupffer cells promote hepatic steatosis via interleukin-1beta-dependent suppression of peroxisome proliferator-activated receptor alpha activity, Hepatology, 51, 511-522, (2010)
[55] Stone, M., Cross-validatory choice and assessment of statistical predictions, J. R. Stat. Soc. B., 36, 111-147, (1974) · Zbl 0308.62063
[56] Suárez-Fariñas, M.; Shah, K. R.; Haider, A. S.; Krueger, J. G.; Lowes, M. A., Personalized medicine in psoriasis: developing a genomic classifier to predict histological response to Alefacept, BMC Dermatol, 10, 1-8, (2010)
[57] Sun, S.; Peng, Q.; Shakoor, A., A kernel-based multivariate feature selection method for microarray data classification, PloS ONE, 9, (2014)
[58] Tibshirani, R. J., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Series. B. Stat. Methodo., 58, 267-288, (1996) · Zbl 0850.62538
[59] Van Den Burg, G. J.; Groenen, P. J., GenSVM: a generalized multiclass support vector machine, J. Mach. Learn. Res., 17, 7964-8005, (2016) · Zbl 1404.68127
[60] Vapnik, V. N., The Nature of Statistical Learning Theory, (1995), Springer: Springer New York · Zbl 0833.62008
[61] Vapnik, V. N., An overview of statistical learning theory, IEEE Trans. Neural. Netw., 10, 988-999, (1999)
[62] Wang, H.; Zheng, B.; Yoon, S. W.; Ko, H. S., A support vector machine-based ensemble algorithm for breast cancer diagnosis, Eur. J. Oper. Res., 267, 687-699, (2017) · Zbl 1403.92109
[63] Wang, S. L.; Li, X. L.; Fang, J., Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification, BMC Bioinform., 13, 178, (2012)
[64] Wang, X.; Gotoh, O., Accurate molecular classification of cancer using simple rules, BMC Med. Genom., 2, 64, (2009)
[65] Wong, T. T.; Liu, K. L., A probabilistic mechanism based on clustering analysis and distance measure for subset gene selection, Expert Syst. Appl., 37, 2144-2149, (2010)
[66] Xiang, S.; Nie, F.; Meng, G.; Pan, C.; Zhang, C., Discriminative least squares regression for multiclass classification and feature selection, IEEE Trans. Neural Netw. Learn Syst., 23, 1738, (2012)
[67] Yu, H. F.; Huang, F. L.; Lin, C. J., Dual coordinate descent methods for logistic regression and maximum entropy models, Mach. Learn., 85, 41-75, (2011) · Zbl 1237.62090
[68] Yuan, G.; Chang, K.; Hsieh, C.; Lin, C., A Comparison of optimization methods and software for large-scale L1-regularized linear classification, J. Mach. Learn. Res., 11, 3183-3234, (2010) · Zbl 1242.62065
[69] Yuan, G. X.; Ho, C. H.; Lin, C. J., An improved glmnet for L1-regularized logistic regression, J. Mach. Learn. Res., 13, 1999-2030, (2012) · Zbl 1432.68404
[70] Zennaro, D.; Scala, E.; Pomponi, D.; Caprini, E.; Arcelli, D.; Gambineri, E.; Gambineri, E.; Russo, G.; Mari., A., Proteomics plus genomics approaches in primary immunodeficiency: the case of immune dysregulation, polyendocrinopathy, enteropathy, X-linked (IPEX) syndrome, Clin. Exp. Immunol., 167, 120-128, (2012)
[71] Zhang, L.; Liu, H.; Huang, Y.; Wang, X.; Chen, Y.; Meng, J., Cancer progression prediction using gene interaction regularized elastic net, IEEE/ACM Trans. Comput. Biol. Bioinform., 14, 145-154, (2017)
[72] Zou, H., The adaptive lasso and its oracle properties, J. Am. Stat. Assoc., 101, 1418-1429, (2006) · Zbl 1171.62326
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.