A statistical framework for the analysis of microarray probe-level data. (English) Zbl 1126.62111

Summary: In microarray technology, a number of critical steps are required to convert the raw measurements into the data relied upon by biologists and clinicians. These data manipulations, referred to as preprocessing, influence the quality of the ultimate measurements and studies that rely upon them. A standard operating procedure for microarray researchers is to use preprocessed data as the starting point for statistical analyses that produce reported results. This has prevented many researchers from carefully considering their choice of preprocessing methodology. Furthermore, the fact that the preprocessing step affects the stochastic properties of the final statistical summaries is often ignored.
We propose a statistical framework that permits the integration of preprocessing into the standard statistical analysis flow of microarray data. This general framework is relevant in many microarray platforms and motivates targeted analysis methods for specific applications. We demonstrate its usefulness by applying the idea in three different applications of the technology.


62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology


sma; BGX; vsn; Affycomp III; gcrma
Full Text: DOI arXiv Euclid


[1] Amaratunga, D. and Cabrera, J. (2001). Analysis of data from viral DNA microchips. J. Amer. Statist. Assoc. 96 1161-1170. · Zbl 1073.62572
[2] Chu, T.-M., Weir, B. and Wolfinger, R. (2002). A systematic statistical linear modeling approach to oligonucleotide array experiments. Math. Biosci. 176 35-51. · Zbl 0997.62087
[3] Chudin, E., Walker, R., Kosaka, A., Wu, S. X., Rabert, D., Chang, T. K. and Kreder, D. E. (2001). Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip arrays. Genome Biol. 3 RESEARCH0005.
[4] Cope, L., Irizarry, R., Jaffee, H., Wu, Z. and Speed, T. (2004). A benchmark for Affymetrix Genechip expression measures. Bioinformatics 20 323-331.
[5] Cui, X., Kerr, M. K. and Churchill, G. A. (2003). Transformations for cDNA microarray data. Statistical Applications in Genetics and Molecular Biology 2 Article 4. · Zbl 1038.92015
[6] Dudoit, S., Yang, Y. H., Callow, M. J. and Speed, T. P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statist. Sinica 12 111-139. · Zbl 1004.62088
[7] Durbin, B. P., Hardin, J. S., Hawkins, D. M. and Rocke, D. M. (2002). A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 18 (Suppl. 1) S105-S110.
[8] Geller, S. C., Gregg, J. P., Hagerman, P. and Rocke, D. M. (2003). Transformation and normalization of oligonucleotide microarray data. Bioinformatics 19 1817-1823.
[9] Getz, G., Levine, E. and Domany, E. (2000). Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA 97 12079-12084.
[10] Giaever, G., Chu, A. M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A., Anderson, K., Andre, B., Arkin, A. P., Astromoff, A., El-Bakkoury, M., Bangham, R., Benito, R., Brachat, S., Campanaro, S., Curtiss, M., Davis, K., Deutschbauer, A., Entian, K. D., Flaherty, P., Foury, F., Garfinkel, D. J., Gerstein, M., Gotte, D., Guldener, U., Hegemann, J. H., Hempel, S., Herman, Z., Jaramillo, D. F., Kelly, D. E., Kelly, S. L., Kotter, P., LaBonte, D., Lamb, D. C., Lan, N., Liang, H., Liao, H., Liu, L., Luo, C., Lussier, M., Mao, R., Menard, P., Ooi, S., Revuelta, J., Roberts, C., Rose, M., Ross-Macdonald, P., Scherens, B., Schimmack, G., Shafer, B., Shoemaker, D. D., Sookhai-Mahadeo, S., Storms, R. K., Strathern, J. N., Valle, G., Voet, M., Volckaert, G., Wang, C. Y., Ward, T. R., Wilhelmy, J., Winzeler, E. A., Yang, Y., Yen, G., Youngman, E., Yu, K., Bussey, H., Boeke, J. D., Snyder, M., Philippsen, P., Davis, R. W. and Johnston, M. (2002). Functional profiling of the Saccharomyces cerevisiae genome. Nature 418 387-391.
[11] Gottardo, R., Pannucci, J. A., Kuske, C. R. and Brettin, T. (2003). Statistical analysis of microarray data: A Bayesian approach. Biostatistics 4 597-620. · Zbl 1197.62147
[12] Hein, A.-M., Richardson, S., Causton, H. C., Ambler, G. K. and Green, P. J. (2005). BGX: A fully Bayesian gene expression index for Affymetrix GeneChip data. Biostatistics 6 349-373. · Zbl 1070.62103
[13] Hekstra, D., Taussig, A. R., Magnasco, M. and Naef, F. (2003). Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res. 31 1962-1968.
[14] Hubbell, E., Liu, W.-M. and Mei, R. (2002). Robust estimators for expression analysis. Bioinformatics 18 1585-1592.
[15] Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A. and Vingron, M. (2002). Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 1 1-9. · Zbl 1142.62100
[16] Irizarry, R. A., B. Hobbs, F. C., Beaxer-Barclay, Y., Antonellis, K., Scherf, U. and Speed, T. (2003a). Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics 4 249-264. · Zbl 1141.62348
[17] Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B. and Speed, T. P. (2003b). Summaries of affymetrix genechip probe level data. Nucleic Acids Research 31 .
[18] Irizarry, R. A., Wu, Z. and Jaffee, H. (2006). Comparison of affymetrix genechip expression measures. Bioinformatics 22 789-794. · Zbl 1142.62100
[19] Kendziorski, C. M., Newton, M. A., Lan, H. and Gould, M. (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine 22 3899-3914.
[20] Kerr, M., Afshari, C., Bennett, L., Bushel, P., Martinez, J., Walker, N. and Churchill, G. (2002). Statistical analysis of a gene expression microarray experiment with replication. Statist. Sinica 12 203-217. · Zbl 1004.62083
[21] Kerr, M. K., Martin, M. and Churchill, G. A. (2000). Analysis of variance for gene expression microarray data. J. Comput. Biol. 7 819-837.
[22] Lee, M.-L. T., Kuo, F. C., Whitmore, G. A. and Sklar, J. (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. USA 97 9834-9839. · Zbl 0955.92016
[23] Li, C. and Wong, W. (2001). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. USA 98 31-36. · Zbl 0990.62091
[24] Liu, W., Mei, R., Di, X., Ryder, T. B., Hubbell, E., Dee, S., Webster, T. A., Harrington, C. A., Ho, M., Baid, J. and Smeekens, S. P. (2002). Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics 18 1593-1599. · Zbl 1154.68497
[25] Liu, X., Milo, M., Lawrence, N. D. and Rattray, M. (2006). Probe-level measurement error improves accuracy in detecting differential gene expression. Bioinformatics 22 2107-2113. · Zbl 1154.68497
[26] Lonnstedt, I. and Speed, T. (2002). Replicated microarray data. Statist. Sinica 12 31-46. · Zbl 1004.62086
[27] Meyer, C., Gottardo, R., Carroll, J., Brown, M. and Liu, X. (2006). Model-based analysis of tiling-arrays for chip-chip. Proc. Natl. Acad. Sci. 103 12457-12462.
[28] Naef, F. and Magnasco, M. O. (2003). Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Phys. Rev. E 68 011906.
[29] Newton, M., Kendziorski, C., Richmond, C., Blattner, F. and Tsui, K. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comput. Biol. 8 37-52.
[30] Pan, W., Lin, J. and Le, C. (2003). A mixture model approach to detecting differentially expressed genes with microarray data. Functional Integrative Genomics 3 117-124.
[31] Peyser, B. D., Irizarry, R. A., Tiffany, C., Chen, O., Yuan, D. S., Boeke, J. D. and Spencer, F. A. (2005). Improved statistical analysis of budding yeast tag microarrays revealed by defined spike-in pools. Nucieic Acids Res. 33 40.
[32] Rattray, M., Liu, X., Sanguinetti, G., Milo, M. and Lawrence, N. D. (2006). Propagating uncertainty in microarray data analysis. Briefings in Bioinformatics 7 37-47.
[33] Rocke, D. M. and Durbin, B. (2001). A model for measurement error for gene expression arrays. J. Comput. Biology 8 557-569.
[34] Schena, M., Shalon, D., Heller, R., Chai, A., Brown, P. and Davis, R. (1996). Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proc. Natl. Acad. Sci. USA 93 10614-10619.
[35] Singh-Gasson, S., Green, R. D., Yue, Y., Nelson, C., Blattner, F., Sussman, M. R. and Cerrina, F. (1999). Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nature Biotechnology 17 974-978.
[36] Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3 Article 3. · Zbl 1038.62110
[37] Tusher, V., Tibshirani, R. and Chu, C. (2001). Significance analysis of microarrays applied to ionizing radiation response. Proc. Natl. Acad. Sci. USA 98 5116-5121. · Zbl 1012.92014
[38] Wang, W., Caravalho, B., Miller, N., Pevsner, J., Chakravarti, A. and Irizarry, R. A. (2006a). Estimating genome-wide copy number using allele specific mixture models. Working Papers 122, Dept. of Biostatistics, Johns Hopkins University. Available at http://www.bepress.com/jhubiostat/paper122. URL:
[39] Wang, X., He, H., Li, L., Chen, R., Deng, X. W. and Li, S. (2006b). Nmpp: A user-customized nimblegen microarray data processing pipeline. Bioinformatics 22 2955-2957.
[40] Wolfinger, R., Gibson, G., Wolfinger, E., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C. and Paules, R. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8 625-637.
[41] Wu, Z. and Irizarry, R. (2004). Stochastic models inspired by hybridization theory for short oligonucleotide arrays. In Proceedings of RECOMB 2004 . J. Comput. Biol. 12 882-893.
[42] Wu, Z., Irizarry, R., Gentlemen, R., Martinez-Murillo, F. and Spencer, F. (2004). A model-based background adjustment for oligonucleotide expression arrays. J. Amer. Statist. Assoc. 99 909-917. · Zbl 1055.62129
[43] Wu, Z. and Irizarry, R. A. (2005). A statistical framework for the analysis of microarray probe-level data. Working papers, Dept. Biostatistics, Johns Hopkins Univ. Available at http://www.bepress.com/jhubiostat/paper73. URL:
[44] Yang, I. V., Chen, E., Hasseman, J. P., Liang, W., Frank, B. C., Wang, S., Sharov, V., Saeed, A. I., White, J., Li, J., Lee, N. H., Yeatman, T. J. and Quackenbush, J. (2002). Within the fold: Assessing differential expression measures and reproducibility in microarray assays. Genome Biology 3 research0062.1-0062.12.
[45] Yuan, D., Pan, X., Ooi, S., Peyser, B., Spencer, F., Irizarry, R. and Boeke, J. (2005). Improved microarray methods for profiling the yeast knockout strain collection.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.