Finding predictive gene groups from microarray data. (English) Zbl 1047.62103

Summary: Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable of interest. To find these groups, we suggest the use of supervised algorithms: these are procedures which use external information about the response variable for grouping the genes.
We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, gene grouping and sample classification in a supervised, simultaneous way. With an empirical study on six different microarray datasets, we show that Pelora identifies gene groups whose expression centroids have very good predictive potential and yield results that can keep up with state-of-the-art classification methods based on single genes. Thus, our gene groups can be beneficial in medical diagnostics and prognostics, but they may also provide more biological insights into gene function and regulation.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62J12 Generalized linear models (logistic models)
65C60 Computational problems in statistics (MSC2010)
62H30 Classification and discrimination; cluster analysis (statistical aspects)
92D10 Genetics and epigenetics
Full Text: DOI


[1] Alizadeh, A.; Eisen, M.; Davis, E.; Ma, C.; Lossos, I.; Rosenwald, A.; Boldrick, J.; Sabet, H.; Tran, T.; Yu, X.; Powell, J.; Yang, L.; Marti, G.; Moore, T.; Hudson, J.; Lu, L.; Lewis, D.; Tibshirani, R.; Sherlock, G.; Chan, W.; Greiner, T.; Weisenburger, D.; Armitage, J.; Warnke, R.; Levy, R.; Wilson, W.; Grever, M.; Byrd, J.; Botstein, D.; Brown, P.; Staudt, L., Distinct types of diffuse large b-cell-lymphoma identified by gene expression profiling, Nature, 403, 503-511, (2000)
[2] Allwein, E.; Schapire, R.; Freund, Y., Reducing multiclass to binary: A unifying approach for margin classifiers, J. Mach. learning res., 1, 113-141, (2000) · Zbl 1013.68175
[3] Alon, U.; Barkai, N.; Notterdam, D.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. nat. acad. sci., 96, 6745-6750, (1999)
[4] P. Bickel, C. Klaassen, Y. Ritov, J. Wellner, Efficient and Adaptive Estimation for Semiparametric Models, Johns Hopkins University Press, 1993. · Zbl 0786.62001
[5] Le Cessie, S.; Van Houwelingen, J., Ridge estimators in logistic regression, Appl. statist., 41, 191-201, (1990) · Zbl 0825.62593
[6] Dettling, M.; Bühlmann, P., Supervised clustering of genes, Genome. biol. res., 3, 0069.1-0069.15, (2002)
[7] Dettling, M.; Bühlmann, P., Boosting for tumor classification with microarray data, Bioinformatics, 19, 1061-1069, (2003)
[8] Dudoit, S.; Fridlyand, J., A prediction-based resampling method to estimate the number of clusters in a dataset, Genome. biol. res., 3, 0036.1-0036.21, (2002)
[9] Dudoit, S.; Fridlyand, J.; Speed, T., Comparison of discrimination methods for the classification of tumors using gene expression data, J. amer. statist. assoc., 97, 77-87, (2002) · Zbl 1073.62576
[10] P. Eilers, J. Boer, G.-J. Van Ommen, H. Van Houwelingen, Classification of microarray data with penalized logistic regression, in: Proceedings of SPIE: Progress in Biomedical Optics and Imaging, Vol. 2, 2001, pp. 187-198.
[11] Furey, T.; Cristianini, N.; Duffy, N.; Bednarski, D.; Schummer, M.; Haussler, D., Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, 16, 906-914, (2000)
[12] Golub, T.; Slonim, D.; Tamayo, P.; Huard, C.; Gassenbeek, M.; Mesirov, J.; Coller, H.; Loh, M.; Downing, J.; Caliguri, M.; Bloomfield, C.; Lander, E., Molecular classification of cancerclass discovery and class prediction by gene expression monitoring, Science, 286, 531-538, (1999)
[13] Hastie, T.; Tibshirani, R.; Botstein, D.; Brown, P., Supervised harvesting of expression trees, Genome. biol. res., 2, 0003.1-0003.12, (2001)
[14] Hoerl, A.; Kennard, R., Ridge regressionbiased estimation for nonorthogonal problems, Technometrics, 12, 55-67, (1970) · Zbl 0202.17205
[15] Huang, E.; Chen, S.; Dressman, H.; Pittman, J.; Tsou, M.; Hong, C.; Bild, A.; Iversen, E.; Liao, M.; Chen, C.; West, M.; Nevins, J.; Huang, A., Gene expression predictors of breast cancer outcomes, The lancet, 361, 1590-1596, (2003)
[16] Jörnsten, R.; Yu, B., Simultaneous gene clustering and subset selection for sample classification via MDL, Bioinformatics, 19, 1100-1109, (2003)
[17] Nguyen, D.; Rocke, D., Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50, (2002)
[18] Singh, D.; Febbo, P.; Ross, K.; Jackson, D.; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, A.; D’Amico, A.; Richie, J.; Lander, E.; Loda, M.; Kantoff, P.; Golub, T.; Sellers, W., Gene expression correlates of clinical prostate cancer behavior, Cancer cell, 1, 203-209, (2002)
[19] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a dataset via the gap statistic, Technical Report 208, Department of Statistics, University of Stanford 2000. · Zbl 0979.62046
[20] Van’t Veer, L.; Dai, H.; Van de Vijver, M.; He, Y.; Hart, A.; Mao, M.; Peterse, H.; Van der Kooy, K.; Marton, M.; Witteveen, A.; Schreiber, G.; Kerkhoven, R.; Roberts, C.; Linsley, P.; Bernards, R.; Friend, S., Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415, 530-535, (2002)
[21] West, M.; Blanchette, C.; Dressman, H.; Huang, E.; Ishida, S.; Spang, R.; Zuzan, H.; Olson, J.; Marks, J.; Nevins, J., Predicting the clinical status of human breast cancer by using gene expression profiles, Proc. nat. acad. sci., 98, 11462-11467, (2001)
[22] J. Zhu, T. Hastie, Classification of gene microarrays by penalized logistic regression, Technical Report, Department of Statistics, University of Stanford, 2002. · Zbl 1154.62406
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.