×

Clustering and classification based on the L\(_{1}\) data depth. (English) Zbl 1047.62064

Summary: Clustering and classification are important tasks for the analysis of microarray gene expression data. Classification of tissue samples can be a valuable diagnostic tool for diseases such as cancer. Clustering samples or experiments may lead to the discovery of subclasses of diseases. Clustering genes can help identify groups of genes that respond similarly to a set of experimental conditions. We also need validation tools for clustering and classification. Here, we focus on the identification of outliers – units that may have been misallocated, or mislabeled, or are not representative of the classes or clusters.
We present two new methods: DDclust and DDclass, for clustering and classification. These nonparametric methods are based on the intuitively simple concept of data depth. We apply the methods to several gene expression and simulated data sets. We also discuss a convenient visualization and validation tool – the relative data depth plot.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

clusfind
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Alon, U.; Barkai, N.; Notterdam, D. A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A. J., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Nat. Acad. Sci., 96, 6745-6750 (1999)
[2] Christmann, A., Classification based on the SVM and on regression depth, statistical data analysis based on the L1norm and related methods, Statistics for Industry and Technology (2002), Birkhauser: Birkhauser Basel
[3] S. Dudoit, J. Fridlyand, Application of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method, Technical Report 600, Department of Statistics, UC Berkeley, 2001.; S. Dudoit, J. Fridlyand, Application of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method, Technical Report 600, Department of Statistics, UC Berkeley, 2001.
[4] Dudoit, S.; Fridlyand, J.; Speed, T., Comparison of discrimination methods for the classification of tumours using gene expression data, J. Amer. Statist. Assoc., 97, 77-87 (2002) · Zbl 1073.62576
[5] Silicon Genetics. Genespring. http://www.sigenetics.com/Products/GeneSpring/,2002.; Silicon Genetics. Genespring. http://www.sigenetics.com/Products/GeneSpring/,2002.
[6] R. Gentleman, R. Ihaka, Their project. http://www.r-project.org/.; R. Gentleman, R. Ihaka, Their project. http://www.r-project.org/.
[7] Gloub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gassenbeek, M.; Mesirov, J. P.; Coller, H.; Loh, M. L.; Downing, J. R.; Caliguiri, M. A.; Bloomfield, C. D.; Lander, E. S., Molecular classification of cancerclass discovery and class prediction by gene expression monitoring, Science, 286, 531-537 (1999)
[8] T. Hastie, R. Tibshirani, D. Botstein, P. Brown, Supervised harvesting of expression trees, Technical Report, Department of Statistics, Stanford University, 2000.; T. Hastie, R. Tibshirani, D. Botstein, P. Brown, Supervised harvesting of expression trees, Technical Report, Department of Statistics, Stanford University, 2000.
[9] Jörnsten, R.; Vardi, Y.; Zhang, C.-H., A robust clustering method and visualization tool based on data depth, statistical data analysis based on the L1norm and related methods, Statistics for Industry and Technology (2002), Birkhauser: Birkhauser Basel
[10] P. J. Rousseeuw, L. Kaufman., Finding Groups in Data: An Introduction to Cluster Analysis (1990), Wiley: Wiley New York · Zbl 1345.62009
[11] Liu, R.; Parelius, P.; Singh, K., Multivariate analysis by data depthdescriptive statistics, graphics and inference (with discussion), Ann. Statist., 27, 783-858 (1999) · Zbl 0984.62037
[12] Sullivan, G. J., Efficient scalar quantization of exponential and laplacian random variables, IEEE Trans. Inform. Theory, 42, 1365-1374 (1996) · Zbl 0860.94017
[13] R. Tibshirani, G. Walther, D. Botstein, P. Brown, Cluster validation by prediction strength, Technical Report, Department of Biostatistics, Stanford University, 2001.; R. Tibshirani, G. Walther, D. Botstein, P. Brown, Cluster validation by prediction strength, Technical Report, Department of Biostatistics, Stanford University, 2001.
[14] M. van der Laan, K. Pollard, J. Bryan, A new partitioning around medoids algorithm, Technical Report, Division of Biostatistics, UC Berkeley, 2002.; M. van der Laan, K. Pollard, J. Bryan, A new partitioning around medoids algorithm, Technical Report, Division of Biostatistics, UC Berkeley, 2002. · Zbl 1054.62075
[15] Vardi, Y.; Zhang, C.-H., The multivariate \(l_1\)-median and associated data depth, Proc. Nat. Acad. Sci., 97, 1423-1426 (2000) · Zbl 1054.62067
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.