Classification and clustering of sequencing data using a Poisson model. (English) Zbl 1234.62150

Summary: In recent years, advances in high throughput sequencing technology have led to a need for specialized methods for the analysis of digital gene expression data. While gene expression data measured on a microarray take on continuous values and can be modeled using the normal distribution, RNA sequencing data involve nonnegative counts and are more appropriately modeled using a discrete count distribution, such as the Poisson or the negative binomial. Consequently, analytic tools that assume a Gaussian distribution (such as classification methods based on linear discriminant analysis and clustering methods that use Euclidean distance) may not perform as well for sequencing data as methods that are based upon a more appropriate distribution.
We propose new approaches for performing classification and clustering of observations on the basis of sequencing data. Using a Poisson loglinear model, we develop an analog of diagonal linear discriminant analysis that is appropriate for sequencing data. We also propose an approach for clustering sequencing data using a new dissimilarity measure that is based upon the Poisson model. We demonstrate the performances of these approaches in a simulation study, on three publicly available RNA sequencing data sets, and on a publicly available chromatin immunoprecipitation sequencing data set.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
92C40 Biochemistry, molecular biology
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI arXiv


[1] Agresti, A. (2002). Categorical Data Analysis . Wiley, Hoboken, NJ. · Zbl 1018.62002
[2] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol. 11 R106.
[3] Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika 35 246-254. · Zbl 0032.03702 · doi:10.1093/biomet/35.3-4.246
[4] Auer, P. L. and Doerge, R. W. (2010). Statistical design and analysis of RNA sequencing data. Genetics 185 405-416.
[5] Barrett, T., Suzek, T. O., Troup, D. B., Wilhite, S. E., Ngau, W.-C., Ledoux, P., Rudnev, D., Lash, A. E., Fujibuchi, W. and Edgar, R. (2005). NCBI GEO: Mining millions of expression profiles-database and tools. Nucleic Acids Res. 33 D562-D566.
[6] Berninger, P., Gaidatzis, D., van Nimwegen, E. and Zavolan, M. (2008). Computational analysis of small RNA cloning data. Methods 44 13-21. · Zbl 1217.65200 · doi:10.1007/978-3-642-11304-8_8
[7] Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10 989-1010. · Zbl 1064.62073 · doi:10.3150/bj/1106314847
[8] Brown, P. and Botstein, D. (1999). Exploring the new world of the genome with DNA microarrays. Nature Genetics 21 33-37.
[9] Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11 94.
[10] Cai, L., Huang, H., Blackshaw, S., Liu, J., Cepko, C. and Wong, W. (2004). Clustering analysis of SAGE data using a Poisson approach. Genome Biology 5 R51.
[11] DeRisi, J., Iyer, V. and Brown, P. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278 680-686.
[12] Dudoit, S., Fridlyand, J. and Speed, T. P. (2001). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 96 1151-1160. · Zbl 1073.62576 · doi:10.1198/016214502753479248
[13] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning : Data Mining, Inference, and Prediction . Springer, New York. · Zbl 1273.62005
[14] Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science 316 1497-1502.
[15] Kasowski, M., Grubert, F., Heffelfinger, C., Hariharan, M., Asabere, A., Waszak, S. M., Habegger, L., Rozowsky, J., Shi, M., Urban, A. E., Hong, M.-Y., Karczewski, K. J., Huber, W., Weissman, S. M., Gerstein, M. B., Korbel, J. O. and Snyder, M. (2010). Variation in transcription factor binding among humans. Science 328 232-235.
[16] Lee, S., Huang, J. Z. and Hu, J. (2010). Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4 1579-1601. · Zbl 1202.62084 · doi:10.1214/10-AOAS327
[17] Li, J., Witten, D., Johnstone, I. and Tibshirani, R. (2011). Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics .
[18] Linsen, S. E. V., de Wit, E., Janssens, G., Heater, S., Chapman, L., Parkin, R. K., Fritz, B., Wyman, S. K., de Bruijn, E., Voest, E. E., Kuersten, S., Tewari, M. and Cuppen, E. (2009). Limitations and possibilities of small RNA digital gene expression profiling. Nature Methods 6 474-476.
[19] Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18 1509-1517.
[20] Monti, S., Savage, K. J., Kutok, J. L., Feuerhake, F., Kurtin, P., Mihm, M., Wu, B., Pasqualucci, L., Neuberg, D., Aguiar, R. C. T., Cin, P. D., Ladd, C., Pinkus, G. S., Salles, G., Harris, N. L., Dalla-Favera, R., Habermann, T. M., Aster, J. C., Golub, T. R. and Shipp, M. A. (2005). Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. Blood 105 1851-1861.
[21] Morozova, O., Hirst, M. and Marra, M. A. (2009). Applications of new sequencing technologies for transcriptome analysis. Annu. Rev. Genomics Hum. Genet. 10 135-151.
[22] Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5 621-628.
[23] Nagalakshmi, U., Wong, Z., Waern, K., Shou, C., Raha, D., Gerstein, M. and Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 302 1344-1349.
[24] Nielsen, T., West, R., Linn, S., Alter, O., Knowling, M., O’Connell, J. S. Z., Fero, M., Sherlock, G., Pollack, J., Brown, P., Botstein, D. and van de Rijn, M. (2002). Molecular characterisation of soft tissue tumours: A gene expression study. The Lancet 359 1301-1307.
[25] Oshlack, A., Robinson, M. and Young, M. (2010). From RNA-seq reads to differential expression results. Genome Biology 11 220.
[26] Oshlack, A. and Wakefield, M. (2009). Transcript length bias in RNA-seq data confounds system biology. Biology Direct 4 14.
[27] Pepke, S., Wold, B. and Mortazavi, A. (2009). Computation for ChIP-seq and RNA-seq studies. Nature Methods 6 S22-S32.
[28] Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J., Poggio, T., Gerald, W., Loda, M., Lander, E. and Golub, T. (2001). Multiclass cancer diagnosis using tumor gene expression signature. PNAS 98 15149-15154.
[29] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc. 66 846-850.
[30] Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139-140.
[31] Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11 R25.
[32] Spellman, P. T., Sherlock, G., Iyer, V. R., Zhang, M., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. and Futcher, B. (1998). Comprehensive identification of cell cycle-reulated genes of the yeast saccharomyces by microarray hybridization. Mol. Cell. Biol. 9 3273-3975.
[33] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567-6572.
[34] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statist. Sci. 18 104-117. · Zbl 1048.62109 · doi:10.1214/ss/1056397488
[35] Wang, S. M. (2007). Understanding SAGE data. Trends Genet. 23 42-50.
[36] Wang, Z., Gerstein, M. and Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10 57-63.
[37] Wilhelm, B. T. and Landry, J.-R. (2009). RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48 249-257.
[38] Witten, D. and Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. J. Roy. Statist. Soc. Ser. B 73 753-772. · Zbl 1228.62079 · doi:10.1111/j.1467-9868.2011.00783.x
[39] Witten, D., Tibshirani, R., Gu, S., Fire, A. and Lui, W. (2010). Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumous and matched controls. BMC Biology 8 58.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.