Sample size determination for training cancer classifiers from microarray and RNA-seq data. (English) Zbl 1454.62390

Summary: The objective of many high-dimensional microarray and RNA-seq studies is to develop a classifier of cancer patients based on characteristics of their disease. The germinal center B-cell (GCB) classifier study in lymphoma and the National Cancer Institute’s Director’s Challenge lung (DC-lung) study are two examples. In recent years, such classifiers are often developed using regularized regression, such as the lasso. A critical question is whether a better classifier can be developed from a larger training set size and, if so, how large the training set should be. This paper examines these two questions using an existing sample size method and a novel sample size method developed here specifically for lasso logistic regression. Both methods are based on pilot data. We reexamine the lymphoma and lung cancer data sets to evaluate the sample sizes, and use resampling to assess the estimation methods. We also study application to an RNA-seq data set. We find that it is feasible to estimate sample size for regularized logistic regression if an adequate pilot data set exists. The GCB and the DC-lung data sets appear adequate, under specific assumptions. Existing human RNA-seq data sets are by and large inadequate, and cannot be used as pilot data. Pilot RNA-seq data can be simulated, and the methods in this paper can be used for sample size estimation. A MATLAB program is made available.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62J12 Generalized linear models (logistic models)
Full Text: DOI arXiv Euclid


[1] Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 14 6562-6566. · Zbl 1034.92013 · doi:10.1073/pnas.102102699
[2] Bi, X., Rexer, B., Arteaga, C. L., Guo, M. and Mahadevan-Jansen, A. (2014). Evaluating HER2 amplification status and acquired drug resistance in breast cancer cells using Raman spectroscopy. J. Biomed. Opt. 19 25001.
[3] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data : Methods , Theory and Applications . Springer, Heidelberg. · Zbl 1273.62015 · doi:10.1007/978-3-642-20192-9
[4] Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models : A Modern Perspective , 2nd ed. Monographs on Statistics and Applied Probability 105 . Chapman & Hall/CRC, Boca Raton, FL. · Zbl 1119.62063 · doi:10.1201/9781420010138
[5] Cook, J. R. and Stefanski, L. A. (1994). Simulation-extrapolation estimation in parametric measurement errror models. J. Amer. Statist. Assoc. 89 1314-1328. · Zbl 0810.62028 · doi:10.2307/2290994
[6] Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge Series in Statistical and Probabilistic Mathematics 1 . Cambridge Univ. Press, Cambridge. · Zbl 0886.62001
[7] Dettling, M. and Bühlmann, P. (2003). Boosting for tumor classification with gene expression. Bioinformatics 19 1061-1069.
[8] Dobbin, K. K. and Simon, R. M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 8 101-117. · Zbl 1170.62374 · doi:10.1093/biostatistics/kxj036
[9] Dobbin, K. K. and Song, X. (2013). Sample size requirements for training high-dimensional risk predictors. Biostatistics 14 639-652.
[10] Dyrskjøt, L. (2003). Classification of bladder cancer by microarray expression profiling: Towards a general clinical use of microarrays in cancer diagnostics. Expert Rev. Mol. Diagn. 3 635-647.
[11] Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70 892-898. · Zbl 0319.62039 · doi:10.2307/2285453
[12] Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: The.632\(+\) bootstrap method. J. Amer. Statist. Assoc. 92 548-560. · Zbl 0887.62044 · doi:10.2307/2965703
[13] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[14] Frazee, A. C., Langmead, B. and Leek, J. T. (2011). ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12 449.
[15] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 1-22.
[16] Geisser, S. (1993). Predictive Inference : An Introduction . Chapman & Hall, New York. · Zbl 0824.62001
[17] Graveley, B. R., Brooks, A. N., Carlson, J. W., Duff, M. O., Landolin, J. M., Yang, L., Artieri, C. G., van Baren, M. J., Boley, N., Booth, B. W., Brown, J. B., Cherbas, L., Davis, C. A., Dobin, A., Li, R., Lin, W., Malone, J. H., Mattiuzzo, N. R., Miller, D., Sturgill, D., Tuch, B. B., Zaleski, C., Zhang, D., Blanchette, M., Dudoit, S., Eads, B., Green, R. E., Hammonds, A., Jiang, L., Kapranov, P., Langton, L., Perrimon, N., Sandler, J. E., Wan, K. H., Willingham, A., Zhang, Y., Zou, Y., Andrews, J., Bickel, P. J., Brenner, S. E., Brent, M. R., Cherbas, P., Gingeras, T. R., Hoskins, R. A., Kaufman, T. C., Oliver, B. and Celniker, S. E. (2011). The developmental transcriptome of Drosophila melanogaster. Nature 471 473-479.
[18] Hanash, S. M., Baik, C. L. and Kallioniemi, O. (2011). Emerging molecular biomarkers-blood-based strategies to detect and monitor cancer. Nat. Rev. Clin. Oncol. 8 142-150.
[19] Hanfelt, J. J. and Liang, K.-Y. (1995). Approximate likelihood ratios for general estimating functions. Biometrika 82 461-477. · Zbl 0831.62025 · doi:10.1093/biomet/82.3.461
[20] Hanfelt, J. J. and Liang, K.-Y. (1997). Approximate likelihoods for generalized linear errors-in-variables models. J. Roy. Statist. Soc. Ser. B 59 627-637. · Zbl 1090.62547 · doi:10.1111/1467-9868.00087
[21] Huang, Y. and Wang, C. Y. (2000). Cox regression with accurate covariates unascertainable: A nonparametric-correction approach. J. Amer. Statist. Assoc. 95 1209-1219. · Zbl 1008.62040 · doi:10.2307/2669761
[22] Huang, Y. and Wang, C. Y. (2001). Consistent functional methods for logistic regression with errors in covariates. J. Amer. Statist. Assoc. 96 1469-1482. · Zbl 1051.62066 · doi:10.1198/016214501753382372
[23] McShane, L. M. and Hayes, D. F. (2012). Publication of tumor marker research results: The necessity for complete and transparent reporting. J. Clin. Oncol. 30 4223-4232.
[24] Meier, L., van de Geer, S. and Bühlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 53-71. · Zbl 1400.62276 · doi:10.1111/j.1467-9868.2007.00627.x
[25] Moehler, T. M., Seckinger, A., Hose, D., Andrulis, M., Moreaux, J., Hielscher, T., Willlhauck-Fleckenstein, M., Merling, A., Bertsch, U., Jauch, A., Goldschmidt, H., Klein, B. and Schwartz-Albiez, R. (2013). The glycome of normal and malignant plasma cells. PLoS ONE 8 e83719.
[26] Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. R. and Mesirov, J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. J. Comput. Biol. 10 119-142.
[27] Novick, S. J. and Stefanski, L. A. (2002). Corrected score estimation via complex variable simulation extrapolation. J. Amer. Statist. Assoc. 97 472-481. · Zbl 1046.65008 · doi:10.1198/016214502760047005
[28] Pfeffer, U., Romeo, F., Noonan, D. M. and Albini, A. (2009). Predictin of breast cancer metastasis by genomic profiling: Where do we stand? Clin. Exp. Metastasis 26 547-558.
[29] Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., Giltnane, J. M., Hurt, E. M., Zhao, H., Averett, L., Yang, L., Wilson, W. H., Jaffe, E. S., Simon, R., Klausner, R. D., Powell, J., Duffey, P. L., Longo, D. L., Greiner, T. C., Weisenburger, D. D., Sanger, W. G., Dave, B. J., Lynch, J. C., Vose, J., Armitage, J. O., Montserrat, E., López-Guillermo, A., Grogan, T. M., Miller, T. P., LeBlanc, M., Ott, G., Kvaloy, S., Delabie, J., Holte, H., Krajci, P., Stokke, T. and Staudt, L. M. (Lymphoma/Leukemia Molecular Profiling Project) (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med. 346 1937-1947.
[30] Safo, S., Song, X. and Dobbin, K. K. (2015). Supplement to “Sample size determination for training cancer classifiers from microarray and RNA-seq data.” . · Zbl 1454.62390 · doi:10.1214/15-AOAS825
[31] Shedden, K., Taylor, J. M., Enkemann, S. A., Tsao, M. S., Yeatman, T. J., Gerald, W. L., Eschrich, S., Jurisica, I., Giordano, T. J., Misek, D. E., Chang, A. C., Zhu, C. Q., Strumpf, D., Hanash, S., Shepherd, F. A., Ding, K., Seymour, L., Naoki, K., Penell, N., Weir, B., Verhaak, R., Ladd-Acosta, C., Golub, T., Gruidl, M., Sharma, A., Szoke, J., Zakowski, M., Rusch, V., Kris, M., Viale, A., Motoi, N., Travis, W., Conley, B., Seshan, V. E., Meyerson, M., Kuick, R., Dobbin, K. K., Lively, T., Jacobson, J. W. and Beer, D. G. (2008). Gene expression-based survival prediction in lung adenocarcinoma: A multisite, blinded validation study. Nat. Med. 14 822-827.
[32] Simon, R. (2010). Clinical trials for predictive medicine: New challenges and paradigms. Clin. Trials 7 516-524.
[33] Simon, R. M., Radmacher, M. D., Dobbin, K. K. and McShane, L. M. (2003). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 95 14-18.
[34] Stefanski, L. A. and Carroll, R. J. (1987). Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika 74 703-716. · Zbl 0632.62052
[35] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267-288. · Zbl 0850.62538
[36] Varma, S. and Simon, R. M. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7 91.
[37] Zhang, J. X., Song, W., Chen, Z. H., Wei, J. H., Liao, Y. J., Lei, J., Hu, M., Chen, G. Z., Liao, B., Lu, J., Zhao, H. W., Chen, W., He, Y. L., Wang, H. Y., Xie, D. and Luo, J. H. (2013). Prognostic and predictive value of a microRNA signature in stage II colon cancer: A microRNA expression analysis. Lancet Oncol. 14 1295-1306.
[38] Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression. Biostatistics 5 427-443. · Zbl 1154.62406 · doi:10.1093/biostatistics/kxg046
[39] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
[40] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301-320. · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
[41] Zwiener, I., Frisch, B. and Binder, H. (2014). Transforming RNA-seq data to improve the performance of prognostic gene signatures. PLoS ONE 8 e85150.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.