×

zbMATH — the first resource for mathematics

Covariance matrix estimation for left-censored data. (English) Zbl 06921321
Summary: Multivariate methods often rely on a sample covariance matrix. The conventional estimators of a covariance matrix require complete data vectors on all subjects – an assumption that can frequently not be met. For example, in many fields of life sciences that are utilizing modern measuring technology, such as mass spectrometry, left-censored values caused by denoising the data are a commonplace phenomena. Left-censored values are low-level concentrations that are considered too imprecise to be reported as a single number but known to exist somewhere between zero and the laboratory’s lower limit of detection. Maximum likelihood-based covariance matrix estimators that allow the presence of the left-censored values without substituting them with a constant or ignoring them completely are considered. The presented estimators efficiently use all the information available and thus, based on simulation studies, produce the least biased estimates compared to often used competing estimators. As the genuine maximum likelihood estimate can be solved fast only in low dimensions, it is suggested to estimate the covariance matrix element-wise and then adjust the resulting covariance matrix to achieve positive semi-definiteness. It is shown that the new approach succeeds in decreasing the computation times substantially and still produces accurate estimates. Finally, as an example, a left-censored data set of toxic chemicals is explored.
MSC:
62 Statistics
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Baccarelli, A.; Pfeiffer, R.; Consonni, D.; Pesatori, A. C.; Bonzini, M.; Patterson, D. G.J.; Bertazzi, P. A.; Landi, M. T., Handling of dioxin measurement data in the presence of non-detectable values: overview of available methods and their application in the seveso chloracne study, Chemosphere, 60, 7, 898-906, (2005)
[2] Bates, D., Maechler, M., Matrix: Sparse and dense matrix classes and methods 2014, r package version 1.1-2. http://CRAN.R-project.org/package=Matrix.
[3] Browne, R. W.; Whitcomb, B. W., Procedures for determination of detection limits. application to high-performance liquid chromatography analysis of fat-soluble vitamins in human serum, Epidemiology, 21, 4, S4-S9, (2010)
[4] Byrd, R. H.; Lu, P.; Nocedal, J.; Zhu, C., A limited memory algorithm for bound constrained optimization, SIAM J. Sci. Comput., 16, 1190-1208, (1995) · Zbl 0836.65080
[5] Carpenter, J. R.; Kenward, M. G., Multiple imputation and its application, (2013), Wiley
[6] Chen, H.; Quandt, S. A.; Grzywacz, J. G.; Arcury, T. A., A distibution-based multiple imputation method for handling bivariate pesticide data with values below the limit of detection, Environ. Health Perspect., 119, 351-356, (2011)
[7] Chen, H.; Quandt, S. A.; Grzywacz, J. G.; Arcury, T. A., A Bayesian multiple imputation method for handling longitudinal pesticide data with values below the limit of detection, Environmetrics, 24, 2, 132-142, (2013)
[8] Chung, C. J.F., Estimation of covariance matrix from geochemical data with observations below detection limits, Math. Geol., 25, 7, 851-865, (1993) · Zbl 0970.86541
[9] Croux, C.; Ollila, E.; Oja, H., Sign and rank covariance matrices: statistical properties and application to principal component analysis, (Statistical Data Analysis Based on \(\ell_1\)-Norm and Related Methods, (2002), Birkhäuser Basel), 257-269, (Chapter) · Zbl 1145.62343
[10] El-Shaarawi, A. H.; Esterby, S. R., Replacement of censored observations by a constant: an evaluation, Water Res., 26, 6, 835-844, (1992)
[11] Farnham, I. M.; Singh, A. K.; Stetzenbach, K. J.; Johannesson, K. H., Treatment of nondetects in multivariate analysis of groundwater geochemistry data, Chemometr. Intell. Lab. Syst., 60, 265-281, (2002)
[12] Friedman, J.; Hastie, T.; Tibshirani, R., Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432-441, (2008) · Zbl 1143.62076
[13] Golub, G.; Loan, C. F.V., Matrix computations, (1989), John Hopkins Baltimore
[14] Gupta, A. K., Estimation of the mean and standard deviation of normal population from a censored sample, Biometrika, 39, 260-273, (1952) · Zbl 0048.12004
[15] Harter, H. L.; Moore, A. H., Asymptotic variances and covariances of maximum-likelihood estimators, from censored samples, of the parameters of Weibull and gamma populations, Ann. Math. Statist., 38, 557-570, (1967) · Zbl 0168.17502
[16] Helsel, D. R., Less than obvious: statistical treatment of data below the reporting limit, Environ. Sci. Technol., 24, 12, 1766-1774, (1990)
[17] Helsel, D. R., Nondetects and data analysis, (2005), John Wiley & Sons Inc. · Zbl 1058.62111
[18] Helsel, D. R., Fabricating data: how substituting values for nondetects can ruin results, and what can be done about it, Chemosphere, 65, 2434-2439, (2006)
[19] Helsel, D. R., Statistics for censored environmental data using minitab and R, (2011), Wiley · Zbl 1280.62004
[20] Hewett, P.; Ganser, G. H., A comparison of several methods for analysing censored data, Ann. Occup. Hyg., 51, 7, 611-632, (2007)
[21] Higham, N., Computing the nearest correlation matrix—a problem from finance, IMA J. Numer. Anal., 22, 329-343, (2002) · Zbl 1006.65036
[22] Hoerl, A. E.; Kennard, R. W., Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12, 55-67, (1970) · Zbl 0202.17205
[23] Hoffmann, H. J.; Johnson, R. E., Pseudo-likelihood estimation of multivariate normal parameters in the presence of left-censored data, J. Agric. Biol. Environ. Stat., 1-16, (2014)
[24] Hopke, P. K.; Liu, C.; Rubin, D. B., Multiple imputation for multivariate data with missing and below-threshold mmeasurement: time series concentrations of pollutants in the arctic, Biometrics, 57, 22-33, (2001) · Zbl 1209.62359
[25] Huang, J. Z.; Liu, N.; Pourahmadi, M.; Liu, L., Covariance matrix selection and estimation via penalised normal likelihood, Biometrika, 93, 85-98, (2006) · Zbl 1152.62346
[26] Huybrechts, T.; Thas, O.; Dewulf, J.; Langenhov, H. V., How to estimate moments and quantiles of environmental data sets with nondetected observations? A case study on volatile organic compounds in marine water samples, J. Chromatogr. A, 975, 1, 123-133, (2002)
[27] Knaus, J., snowfall: Easier cluster computing (based on snow) 2013, r package version 1.84-6. http://CRAN.R-project.org/package=snowfall.
[28] Koo, J. W.; Parham, F.; Kohn, M. C.; Masten, S. A.; Brock, J. W.; Needham, L. L.; Portier, C. J., The association between biomarker-based exposure estimates for phthalates and demographic factors in a human reference population, Environ. Health Perspect., 110, 4, 405-410, (2002)
[29] Little, R. J.A.; Rubin, D. B., Statistical analysis with missing data, (2002), John Wiley and Sons · Zbl 1011.62004
[30] Locantore, N.; Marron, J. S.; Simpson, D. G.; Tripoli, N.; Zhang, J. T.; Kohen, K. L., Robust principal components for functional data, TEST, 8, 1-73, (1999) · Zbl 0980.62049
[31] Lyles, R. H.; Williams, J. K.; Chuachoowong, R., Correlating two viral load assays with known detection limits, Biometrics, 57, 1238-1244, (2001) · Zbl 1209.62308
[32] Lynn, H. S., Maximum likelihood inference for left-censored HIV RNA data, Stat. Med., 20, 33-45, (2001)
[33] Marden, J., Some robust estimates of principal components, Statist. Probab. Lett., 43, 349-359, (1999) · Zbl 0939.62055
[34] Mehrotra, D., Robust elementwise estimation of a dispersion matrix, Biometrics, 51, 1344-1351, (1995) · Zbl 0875.62208
[35] Perkins, N. J.; Schisterman, E. S.; Vexler, A., Multivariate normally distributed biomarkers subject to limits of detection and receiver operating characteristic curve inference, Acad. Radiol., 20, 7, 838-846, (2013)
[36] R: A language and environment for statistical computing, (2014), R Foundation for Statistical Computing Vienna, Austria, URL: http://www.R-project.org
[37] Rubin, D. B., Multiple imputation after 18+ years (with discussion), J. Amer. Statist. Assoc., 91, 473-489, (1996) · Zbl 0869.62014
[38] Rubin, D. B., Multiple imputation for nonresponse in surveys, (2004), John Wiley and Sons New York, USA · Zbl 1070.62007
[39] Schäfer, J.; Strimmer, K., A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics, Stat. Appl. Genet. Mol. Biol., 4, article 32, (2005)
[40] Song, J.; Barnhart, H. X.; Lyles, R. H., A gee approach for estimating correlation coefficients involving left-censored variables, J. Data Sci., 2, 245-257, (2004)
[41] Succop, P. A.; Clark, S.; Chen, M.; Galke, W., Imputation of data values that less than a detection limit, J. Occup. Environ. Hyg., 1, 7, 436-441, (2004)
[42] Tibshirani, R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B, 58, 267-288, (1996) · Zbl 0850.62538
[43] Visuri, S.; Koivunen, V.; Oja, H., Sign and rank covariance matrices, J. Statist. Plann. Inference, 91, 557-575, (2000) · Zbl 0965.62049
[44] Williams, M. S.; Ebel, E. D., Estimating the correlation between concentrations of two species of bacteria with censored microbial testing data, Int. J. Food Microbiol., 175, 1-5, (2014)
[45] Witten, D. M.; Tibshirani, R.; Hastie, T., A penalized matrix decomposition, with applications to sparse principal ccomponent and canonical correlation analysis, Biostatistics, 10, 515-534, (2009)
[46] Zhao, Y.; Frey, H. C., Uncertainty for data with non-detects: air toxic emissions from combustion, Hum. Ecol. Risk Assess.: Int. J., 12, 6, 1171-1191, (2006)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.