×

Improved biclustering of microarray data demonstrated through systematic performance tests. (English) Zbl 1429.62267

Summary: A new algorithm is presented for fitting the plaid model, a biclustering method developed for clustering gene expression data. The approach is based on speedy individual differences clustering and uses binary least squares to update the cluster membership parameters, making use of the binary constraints on these parameters and simplifying the other parameter updates. The performance of both algorithms is tested on simulated data sets designed to imitate (normalised) gene expression data, covering a range of biclustering configurations. Empirical distributions for the components of these data sets, including non-systematic error, are derived from a real set of microarray data. A set of two-way quality measures is proposed, based on one-way measures commonly used in information retrieval, to evaluate the quality of a retrieved bicluster with respect to a target bicluster in terms of both genes and samples. By defining a one-to-one correspondence between target biclusters and retrieved biclusters, the performance of each algorithm can be assessed. The results show that, using appropriately selected starting criteria, the proposed algorithm out-performs the original plaid model algorithm across a range of data sets. Furthermore, through the rigorous assessment of the plaid model a benchmark for future evaluation of biclustering methods is established.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis
92D10 Genetics and epigenetics

Software:

R; limma
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Arabie, P.; Carroll, J. D., MAPCLUSa mathematical programming approach to fitting the ADCLUS model, Psychometrika, 45, 2, 211-235 (1980) · Zbl 0437.62059
[2] Baeza-Yates, R.; Ribeiro-Neto, B., Modern Information Retrieval (1999), ACM Press Series. Addison-Wesley: ACM Press Series. Addison-Wesley New York
[3] Barash, Y.; Friedman, N., Context-specific bayesian clustering for gene expression data, J. Comput. Biol, 9, 2, 169-191 (2002)
[4] Ben-Dor, A.; Chor, B.; Karp, R.; Yakhini, Z., Discovering local structure in gene expression data: the order-preserving submatrix problem, (Myers, G.; Hannenhalli, S.; Istrail, S.; Pevzner, P.; Waterman, M., Proceedings of the Sixth Annual International Conference on Computational Biology (RECOMB-02) (2002), ACM Press: ACM Press Washington, DC, USA), 49-57
[5] Busygin, S., Jacobsen, G., Krämer, E., 2002. Double conjugated clustering applied to leukemia microarray data. Unpublished paper available at; Busygin, S., Jacobsen, G., Krämer, E., 2002. Double conjugated clustering applied to leukemia microarray data. Unpublished paper available at
[6] Chaturvedi, A.; Carroll, J. D., An alternating combinatorial optimization approach to fitting the INDCLUS and generalized INDCLUS models, J. Classification, 11, 2, 155-170 (1994) · Zbl 0825.62539
[7] Cheng, Y., Church, G.M., 2000. Biclustering of expression data. In: Philip Bourne, M.G., Altman, R., Jensen, N., Hope, D., Lengauer, T., Mitchell, J., Scheeff, E., Smith, C., Strande, S., Weissig, H. (Eds.), Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB-2000), Menlo Park, CA, Vol. 8. AAAI Press, pp. 93-103.; Cheng, Y., Church, G.M., 2000. Biclustering of expression data. In: Philip Bourne, M.G., Altman, R., Jensen, N., Hope, D., Lengauer, T., Mitchell, J., Scheeff, E., Smith, C., Strande, S., Weissig, H. (Eds.), Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB-2000), Menlo Park, CA, Vol. 8. AAAI Press, pp. 93-103.
[8] Colantuoni, C., Henry, G.W., 2004. SNOMAD—standardization and normalization of microarray data. Online tool available at; Colantuoni, C., Henry, G.W., 2004. SNOMAD—standardization and normalization of microarray data. Online tool available at
[9] DeSarbo, W. S., GENNCLUSnew models for general nonhierachical clustering analysis, Psychometrika, 47, 4, 449-475 (1982) · Zbl 0566.62057
[10] Getz, G.; Levine, E.; Domany, E., Coupled two-way clustering analysis of gene microarray data, Proc. Natl. Acad. Sci. USA, 97, 22, 12079-12084 (2000)
[11] Hastie, T.; Tibshirani, R.; Eisen, M. B.; Alizadeh, A.; Levy, R.; Staudt, L.; Chan, W. C.; Botstein, D.; Brown, P., ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns, Genome Biol, 1, 2, 003, 1-21 (2000)
[12] Ihaka, R.; Gentleman, R., Ra language for data analysis and graphics, J. Comput. Graph. Statist, 5, 3, 299-314 (1996)
[13] Kluger, Y.; Basri, R.; Chang, J. T.; Gerstein, M., Spectral biclustering of microarray datacoclustering genes and conditions, Genome Res, 13, 703-716 (2003)
[14] Lazzeroni, L.; Owen, A., Plaid models for gene expression data, Statist. Sinica, 12, 61-86 (2002) · Zbl 1004.62084
[15] Lee, M. D., An extraction and regularization approach to additive clustering, J. Classification, 16, 2, 255-281 (1999) · Zbl 0951.91068
[16] Lee, M. D., A simple method for generating additive clustering models with limited complexity, Mach. Learn, 49, 39-58 (2002) · Zbl 1014.68067
[17] McLachlan, G. J.; Bean, R. W.; Peel, D., A mixture model-based approach to the clustering of microarray expression data, Bioinformatics, 18, 413-422 (2002)
[18] Mirkin, B. G., Additive clustering and qualitative factor analysis methods for similarity matrices, J. Classification, 4, 7-31 (1987) · Zbl 0617.62064
[19] Owen, A., \(2004. Plaid^{TM}\); Owen, A., \(2004. Plaid^{TM}\)
[20] Pollard, K. S.; van der Laan, M. J., Statistical inference for simultaneous clustering of gene expression data, Math. Biosci, 176, 99-121 (2002) · Zbl 0997.62090
[21] Segal, E.; Taskar, B.; Gasch, A.; Friedman, N.; Koller, D., Rich probabilistic models for gene expression, Bioinformatics, 1, 1, 1-10 (2001)
[22] Segal, E., Battle, A., Koller, D., 2003. Decomposing gene expression into cellular processes. Pacific Symposium on Biocomputing, Vol. 8, pp. 89-100. Available at; Segal, E., Battle, A., Koller, D., 2003. Decomposing gene expression into cellular processes. Pacific Symposium on Biocomputing, Vol. 8, pp. 89-100. Available at · Zbl 1219.92027
[23] Shepard, R. N.; Arabie, P., Additive clustering representations of similarities as combinations of discrete overlapping properties, Psychol. Rev, 86, 2, 87-123 (1979)
[24] Singhal, S., Putt, M.E., Kyvernitis, C.G., Johnson, S.W., Kaiser, L.R., Liebman, M.N., Albelda, S.M., 2004. Microarray data simulator to evaluate bioinformatics tools for selecting differentially expressed genes in normal versus cancerous tissues. Available at; Singhal, S., Putt, M.E., Kyvernitis, C.G., Johnson, S.W., Kaiser, L.R., Liebman, M.N., Albelda, S.M., 2004. Microarray data simulator to evaluate bioinformatics tools for selecting differentially expressed genes in normal versus cancerous tissues. Available at
[25] Smyth, G. K.; Speed, T. P., Normalization of cDNA microarray data, Methods, 31, 4, 265-273 (2003)
[26] Strehl, A., 2002. Relationship-based clustering and cluster ensembles for high-dimensional data mining. Ph.D. Thesis, The University of Texas at Austin. Available at; Strehl, A., 2002. Relationship-based clustering and cluster ensembles for high-dimensional data mining. Ph.D. Thesis, The University of Texas at Austin. Available at
[27] Strimmer, K., Modeling gene expression measurement errora quasi-likelihood approach, BMC Bioinformatics, 4, 1, 10 (2003)
[28] Tanay, A.; Sharan, R.; Shamir, R., Discovering statistically significant biclusters in gene expression data, Bioinformatics, 1, 1, 1-9 (2002)
[29] Tang, C., Zhang, L., Zhang, A., Ramanathan, M., 2001. Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In: Bioinformatics and Bioengineering (Bibe 2001): 2nd IEEE International Symposium. IEEE Computer Society, Los Alamitos, CA, pp. 41-48.; Tang, C., Zhang, L., Zhang, A., Ramanathan, M., 2001. Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In: Bioinformatics and Bioengineering (Bibe 2001): 2nd IEEE International Symposium. IEEE Computer Society, Los Alamitos, CA, pp. 41-48.
[30] Tenenbaum, J.B., 1996. Learning the structure of similarity. In: Touretzky, D.S., Hasselmo, M.M.M. (Eds.), Advances in Neural Information Processing Systems, Vol. 8. MIT Press, Cambridge, MA, pp. 3-9.; Tenenbaum, J.B., 1996. Learning the structure of similarity. In: Touretzky, D.S., Hasselmo, M.M.M. (Eds.), Advances in Neural Information Processing Systems, Vol. 8. MIT Press, Cambridge, MA, pp. 3-9.
[31] UPMC Health Systems, 2004. Gene expression data simulator. Available at; UPMC Health Systems, 2004. Gene expression data simulator. Available at
[32] WEHI, 2004. LIMMA: linear models for microarray data. Software available at; WEHI, 2004. LIMMA: linear models for microarray data. Software available at
[33] Wierling, C. K.; Steinfath, M.; Elge, T.; Schulze-Kremer, S.; Aanstad, P.; Clark, M.; Lehrach, H.; Herwig, R., Simulation of DNA array hybridization experiments and evaluation of critical parameters during subsequent image and data analysis, BMC Bioinformatics, 3, 29 (2002)
[34] Yang, Y. H.; Dudoit, S.; Luu, P.; Lin, D. M.; Peng, V.; Ngai, J.; Speed, T. P., Normalization for cDNA microarray dataa robust composite method addressing single and multiple slide systematic variation, Nucleic Acids Res, 30, 4, e15 (2002)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.