Asymptotic inference for high-dimensional data. (English) Zbl 1184.62094

Summary: We study inference for high-dimensional data characterized by small sample sizes relative to the dimension of the data. In particular, we provide an infinite-dimensional framework to study statistical models that involve situations in which (i) the number of parameters increases with the sample size (that is, allowed to be random) and (ii) there is the possibility of missing data. Under a variety of tail conditions on the components of the data, we provide precise conditions for the joint consistency of the estimators of the mean. In the process, we clarify and improve some of the recent consistency results that appeared in the literature.
An important aspect of the work presented is the development of asymptotic normality results for these models. As a consequence, we construct different test statistics for one-sample and two-sample problems concerning the mean vector and obtain their asymptotic distributions as a corollary of the infinite-dimensional results. Finally, we use these theoretical results to develop an asymptotically justifiable methodology for data analyses. Simulation results presented here describe situations where the methodology can be successfully applied. They also evaluate its robustness under a variety of conditions, some of which are substantially different from the technical conditions. Comparisons to other methods used in the literature are provided. Analyses of real-life data are also included.


62H15 Hypothesis testing in multivariate analysis
62E20 Asymptotic distribution theory in statistics
62-07 Data analysis (statistics) (MSC2010)
65C60 Computational problems in statistics (MSC2010)
60F05 Central limit and other weak theorems
62H12 Estimation in multivariate analysis
62P10 Applications of statistics to biology and medical sciences; meta analysis
62G20 Asymptotic properties of nonparametric inference
Full Text: DOI arXiv


[1] Araujo, A. and GinĂ©, E. (1980). The Central Limit Theorem for Real and Banach Valued Random Variables . Wiley, New York. · Zbl 0457.60001
[2] Devroye, L. and Gyorfi, L. (1985). Nonparametric Density Estimation : The L 1 View . Wiley, New York. · Zbl 0546.62015
[3] Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77-87. JSTOR: · Zbl 1073.62576
[4] Feller, W. (1966). An Introduction to Probability Theory and Its Applications . Wiley, New York. · Zbl 0138.10207
[5] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13-30. JSTOR: · Zbl 0127.10602
[6] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531-537.
[7] Kosorok, M. and Ma, S. (2007). Marginal asymptotics for the large p , small n paradigm: With application to microarray data. Ann. Statist. 35 1456-1486. · Zbl 1123.62005
[8] Kuelbs, J. and Vidyashankar, A. N. (2008). Asymptotic inference for high-dimensional data. Preprint. Available at http://mason.gmu.edu/ avidyash. · Zbl 1184.62094
[9] Kuelbs, J. and Vidyashankar, A. N. (2008). Simulation report using structured covariances. Preprint. Available at http://mason.gmu.edu/ avidyash.
[10] Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88 365-411. · Zbl 1032.62050
[11] Lu, Y., Liu, P.-Y., Xiao, P. and Deng, H.-W. (2005). Hotelling’s T 2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics 21 3105-3113.
[12] Okamato, M. (1958). Some inequalities relating to the partial sum of binomial probabilities. Ann. Inst. Statist. Math. 10 29-35. · Zbl 0084.14001
[13] Parthasarathy, K. R. (1967). Probability Measures on Metric Spaces . Academic Press, New York. · Zbl 0153.19101
[14] Paulauskas, V. (1984). On the central limit theorem in c 0 . Probab. Math. Statist. 3 127-141. · Zbl 0555.60009
[15] Portnoy, S. (1984). Asymptotic behavior of M-estimators of p regression parameters when p 2 / n is large. I. Consistency. Ann. Statist. 12 1298-1309. · Zbl 0584.62050
[16] Reverter, A., Wang, Y. H., Byrne, K. A., Tan, S. H., Harper, G. S. and Lehnert, S. A. (2004). Joint analysis of multiple cDNA microarray studies via multivariate mixed models applied to genetic improvement of beef cattle. Journal of Animal Science 82 3430-3439.
[17] Schaffer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. Biol. 4 1-30. · Zbl 1077.92042
[18] van der Lann, M. J. and Bryan, J. (2001). Gene expression analysis with parametric bootstrap. Biostatistics 2 445-461. · Zbl 1097.62571
[19] Yan, X., Deng, M., Fung, W. K. and Qian, M. (2005). Detecting differentially expressed genes by relative entropy. J. Theoret. Biol. 3 395-402.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.