Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. (English) Zbl 1406.92192
Summary: At present, the study of gene expression data provides a reference for tumor diagnosis at the molecular level. It is a challenging task to select the feature genes related to the classification from the high-dimensional and small-sample gene expression data and successfully separate the different subtypes of tumor or between the normal and patient. In this paper, we present a new method for tumor classification – relaxed Lasso (least absolute shrinkage and selection operator) and generalized multi-class support vector machine (rL-GenSVM). The tumor datasets are firstly z-score normalized. Secondly, using relaxed Lasso to select feature gene sets on training set, and finally, generalized multi-class support vector machine (GenSVM) serves as a classifier. We select four two-class datasets and four multi-class datasets for experiments. And four classifiers are used to predict and compare the classification accuracy on test set. To compare with other proposed methods, we obtain satisfactory classification accuracy by 10-fold cross-validation on all samples of each dataset. The experimental results show that the method proposed in this paper selects fewer feature genes and achieves higher classification accuracy. rL-GenSVM uses regularization parameters to avoid overfitting and can be widely applied to high-dimensional and small-sample tumor data classification. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/rL-GenSVM/.

92C40 Biochemistry, molecular biology
68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J05 Linear regression; mixed models
62J07 Ridge regression; shrinkage estimators (Lasso)
