×

zbMATH — the first resource for mathematics

Determining the number of components in PLS regression on incomplete data set. (English) Zbl 1447.62016
Summary: Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the \(Q^2\) criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, \(k\)-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. \(Q^2\)-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.
MSC:
62D10 Missing data
62J02 General nonlinear regression
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Akaike, H. (1969): “Fitting autoregressive models for prediction,” Ann. Ins. Stat. Math., 21, 243-247.; Akaike, H., Fitting autoregressive models for prediction, Ann. Ins. Stat. Math., 21, 243-247 (1969) · Zbl 0202.17301
[2] Arteaga, F. and A. Ferrer (2002): “Dealing with missing data in MSPC: Several methods, different interpretations, some examples,” J. Chemom., 16, 408-418.; Arteaga, F.; Ferrer, A., Dealing with missing data in MSPC: Several methods, different interpretations, some examples, J. Chemom., 16, 408-418 (2002)
[3] Azur, M. J., E. A. Stuart, C. Frangakis and P. J. Leaf (2011): “Multiple imputation by chained equations: what is it and how does it work?” Int. J. Methods Psychiatr. Res., 20, 40-49.; Azur, M. J.; Stuart, E. A.; Frangakis, C.; Leaf, P. J., Multiple imputation by chained equations: what is it and how does it work?, Int. J. Methods Psychiatr. Res., 20, 40-49 (2011)
[4] Bastien, P. and M. Tenenhaus (2003): “PLS regression and multiple imputation.” In: Proceedings of the PLS’03 International Symposium, Vilares, M, Tenenhaus, M, Coelho, P & Esposito Vinzi, V editors CISIA Paris. pp. 497-498.; Bastien, P.; Tenenhaus, M.; Vilares, M.; Tenenhaus, M.; Coelho, P.; Esposito Vinzi, V., Proceedings of the PLS’03 International Symposium (2003)
[5] Bertrand, F., N. Meyer and M. Maumy-Bertrand (2014): plsRglm: partial least squares regression for generalized linear models, book of abstracts, User2014!, Los Angeles. R package version 1.2.5.; Bertrand, F.; Meyer, N.; Maumy-Bertrand, M., plsRglm: partial least squares regression for generalized linear models, book of abstracts, User2014! (2014) · Zbl 1316.62159
[6] Bodner, T. E. (2008): “What improves with increased missing data imputations?” Structur. Equ. Modeling, 15, 651-675.; Bodner, T. E., What improves with increased missing data imputations Structur, Equ. Modeling, 15, 651-675 (2008)
[7] Burnham, A. J., R. Viveros and J. F. Macgregor (1996): “Frameworks for latent variable multivariate regression,” J. Chemom., 10, 31-45.; Burnham, A. J.; Viveros, R.; Macgregor, J. F., Frameworks for latent variable multivariate regression, J. Chemom., 10, 31-45 (1996)
[8] Burnham, A. J., J. F. Macgregor and R. Viveros (1999): “Latent variable multivariate regression modeling,” Chemom. Intell. Lab. Syst., 48, 167-180.; Burnham, A. J.; Macgregor, J. F.; Viveros, R., Latent variable multivariate regression modeling, Chemom. Intell. Lab. Syst., 48, 167-180 (1999)
[9] De Jong, S. (1993): “SIMPLS: an alternative approach squares regression to partial least,” Chemom. Intell. Lab. Syst., 18, 251-263.; De Jong, S., SIMPLS: an alternative approach squares regression to partial least, Chemom. Intell. Lab. Syst., 18, 251-263 (1993)
[10] Dixon, J. K. (1979): “Pattern recognition with partly missing data,” IEEE Trans. Syst. Man Cybern., 10, 617-621.; Dixon, J. K., Pattern recognition with partly missing data, IEEE Trans. Syst. Man Cybern., 10, 617-621 (1979)
[11] Eastment, H. T. and W. J. Krzanowski (1982): “Cross-validatory choice of the number of components from a principal component analysis,” Technometrics, 24, 73-77.; Eastment, H. T.; Krzanowski, W. J., Cross-validatory choice of the number of components from a principal component analysis, Technometrics, 24, 73-77 (1982)
[12] Eriksson, I., E. Johansson, N. Kettaneh-Wold and S. Wold (2002): “Multi- and megavariate data analysis, principles and applications,” J. Chemom., 16, 261-262.; Eriksson, I.; Johansson, E.; Kettaneh-Wold, N.; Wold, S., Multi- and megavariate data analysis, principles and applications, J. Chemom., 16, 261-262 (2002)
[13] Folch-Fortuny, A., F. Arteaga and A. Ferrer (2016): “Missing data imputation toolbox for MATLAB,” Chemom. Intell. Lab. Syst., 154, 93-100.; Folch-Fortuny, A.; Arteaga, F.; Ferrer, A., Missing data imputation toolbox for MATLAB, Chemom. Intell. Lab. Syst., 154, 93-100 (2016)
[14] Goicoechea, H. C. and A. C. Olivieri (1999a): “Determination of bromhexine in cough-cold syrups by absorption spectrophotometry and multivariate calibration using partial least-squares and hybrid linear analyses. Application of a novel method of wavelength selection,” Talanta, 49, 793-800.; Goicoechea, H. C.; Olivieri, A. C., Determination of bromhexine in cough-cold syrups by absorption spectrophotometry and multivariate calibration using partial least-squares and hybrid linear analyses. Application of a novel method of wavelength selection, Talanta, 49, 793-800 (1999)
[15] Goicoechea, H. C. and A. C. Olivieri (1999b): “Enhanced synchronous spectrofluorometric determination of tetracycline in blood serum by chemometric analysis. Comparison of partial least-squares and hybrid linear analysis calibrations,” Anal. Chem., 71, 4361-4368.; Goicoechea, H. C.; Olivieri, A. C., Enhanced synchronous spectrofluorometric determination of tetracycline in blood serum by chemometric analysis. Comparison of partial least-squares and hybrid linear analysis calibrations, Anal. Chem., 71, 4361-4368 (1999)
[16] Goicoechea, H. C. and A. C. Olivieri (2003): “A new family of genetic algorithms for wavelength interval selection in multivariate analytical spectroscopy,” J. Chemom., 17, 338-345.; Goicoechea, H. C.; Olivieri, A. C., A new family of genetic algorithms for wavelength interval selection in multivariate analytical spectroscopy, J. Chemom., 17, 338-345 (2003)
[17] Graham, J. W., A. E. Olchowski and T. D. Gilreath (2007): “How many imputations are really needed? Some practical clarifications of multiple imputation theory,” Prev. Sci., 8, 206-213.; Graham, J. W.; Olchowski, A. E.; Gilreath, T. D., How many imputations are really needed? Some practical clarifications of multiple imputation theory, Prev. Sci., 8, 206-213 (2007)
[18] Grung, B. and R. Manne (1998): “Missing values in principal component analysis,” Chemom. Intell. Lab. Syst., 42, 125-139.; Grung, B.; Manne, R., Missing values in principal component analysis, Chemom. Intell. Lab. Syst., 42, 125-139 (1998)
[19] Horton, N. J. and S. R. Lipsitz (2001): “Multiple imputation in practice: Comparison of software packages for regression models with missing variables,” Am. Stat., 55, 244-254.; Horton, N. J.; Lipsitz, S. R., Multiple imputation in practice: Comparison of software packages for regression models with missing variables, Am. Stat., 55, 244-254 (2001)
[20] Höskuldsson, A. (1988): “PLS regression,” J. Chemom., 2, 211-228.; Höskuldsson, A., PLS regression, J. Chemom., 2, 211-228 (1988)
[21] Kowarik, A. and M. Templ (2016): “Imputation with the R package VIM,” J. Stat. Softw., 74, 1-16.; Kowarik, A.; Templ, M., Imputation with the R package VIM, J. Stat. Softw., 74, 1-16 (2016)
[22] Krämer, N. and M. L. Braun (2015): plsdof: degrees of freedom and statistical inference for partial least squares regression. R package version 0.2-9.; Krämer, N.; Braun, M. L., plsdof: degrees of freedom and statistical inference for partial least squares regression, R package version 0, 2-9 (2015)
[23] Krämer, N. and M. Sugiyama (2012): “The degrees of freedom of partial least squares regression,” J. Am. Stat. Assoc., 106, 697-705.; Krämer, N.; Sugiyama, M., The degrees of freedom of partial least squares regression, J. Am. Stat. Assoc., 106, 697-705 (2012) · Zbl 1232.62099
[24] Kvalheim, O. (1992): “The latent variable,” Chemom. Intell. Lab. Syst., 14, 1-3.; Kvalheim, O., The latent variable, Chemom. Intell. Lab. Syst., 14, 1-3 (1992)
[25] Lazraq, A., R. Cléroux and J.-P. Gauchi (2003): “Selecting both latent and explanatory variables in the PLS1 regression model,” Chemom. Intell. Lab. Syst., 66, 117-126.; Lazraq, A.; Cléroux, R.; Gauchi, J.-P., Selecting both latent and explanatory variables in the PLS1 regression model, Chemom. Intell. Lab. Syst., 66, 117-126 (2003)
[26] Leisch, F. and E. Dimitriadou (2010): mlbench: Machine Learning Benchmark Problems. R package version 2.1-1.; Leisch, F.; Dimitriadou, E., mlbench: Machine Learning Benchmark Problems (2010)
[27] Li, B., J. Morris and E. B. Martin (2002): “Model selection for partial least squares regression,” Chemome. Intell. Lab. Syst., 64, 79-89.; Li, B.; Morris, J.; Martin, E. B., Model selection for partial least squares regression, Chemome. Intell. Lab. Syst., 64, 79-89 (2002)
[28] Little, R. J. and D. B. Rubin (1987): Statistical analysis with missing data,Wiley, New York, Wiley Series in Probability and Statistics - Applied Probability and Statistics Series.; Little, R. J.; Rubin, D. B., Wiley Series in Probability and Statistics - Applied Probability and Statistics Series (1987) · Zbl 0665.62004
[29] Little, R. J. and D. B. Rubin (2002): Statistical analysis with missing data, A John Wiley & Sons, Inc., New York, 2nd edition.; Little, R. J.; Rubin, D. B., Statistical analysis with missing data (2002) · Zbl 1011.62004
[30] Meyer, N., M. Maumy-Bertrand and F. Bertrand (2010): “Comparaison de variantes de régressions logistiques PLS et de régression PLS sur variables qualitatives: application aux données d’allélotypage,” J. Soc. Stat. Paris., 151, 1-18.; Meyer, N.; Maumy-Bertrand, M.; Bertrand, F., Comparaison de variantes de régressions logistiques PLS et de régression PLS sur variables qualitatives: application aux données d’allélotypage, J. Soc. Stat. Paris., 151, 1-18 (2010) · Zbl 1316.62159
[31] Nelson, P. R., P. A. Taylor and J. F. MacGregor (1996): “Missing data methods in PCA and PLS: score calculations with incomplete observations,” Chemom. Intell. Lab. Syst., 35, 45-65.; Nelson, P. R.; Taylor, P. A.; MacGregor, J. F., Missing data methods in PCA and PLS: score calculations with incomplete observations, Chemom. Intell. Lab. Syst., 35, 45-65 (1996)
[32] Nguyen, D. V. and D. M. Rocke (2004): “On partial least squares dimension reduction for microarray-based classification: a simulation study,” Comput. Stat. Data An., 46, 407-425.; Nguyen, D. V.; Rocke, D. M., On partial least squares dimension reduction for microarray-based classification: a simulation study, Comput. Stat. Data An., 46, 407-425 (2004) · Zbl 1429.62578
[33] Oleszko, A., J. Hartwich, A. Wójtowicz, M. Ga̧sior-Głogowska, H. Huras and M. Komorowska (2017): “Comparison of FTIR-ATR and Raman spectroscopy in determination of VLDL triglycerides in blood serum with PLS regression,” Spectrochim. Acta A Mol. Biomol. Spectrosc., 183, 239-246.; Oleszko, A.; Hartwich, J.; Wójtowicz, A.; Ga̧sior-Głogowska, M.; Huras, H.; KomorowskaM., M., Comparison of FTIR-ATR and Raman spectroscopy in determination of VLDL triglycerides in blood serum with PLS regression, Spectrochim. Acta A Mol. Biomol. Spectrosc., 183, 239-246 (2017)
[34] Pérez-Enciso, M. and M. Tenenhaus (2003): “Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach Received,” Hum. Genet., 112, 581-592.; Pérez-Enciso, M.; Tenenhaus, M., Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach Received, Hum. Genet., 112, 581-592 (2003)
[35] Perry, P. O. (2015): bcv: Cross-validation for the SVD (Bi-cross-validation): R package version 1.0.1.; Perry, P. O., bcv: Cross-validation for the SVD (Bi-cross-validation): R package version 1.0.1 (2015)
[36] Rännar, S., P. Geladi, F. Lindgren and S. Wold (1995): “A PLS Kernel algorithm for data sets with many variables and few objects. 2. Cross-validataion, missing data and examples,” J. Chemom., 9, 459-470.; Rännar, S.; Geladi, P.; Lindgren, F.; Wold, S., A PLS Kernel algorithm for data sets with many variables and few objects. 2. Cross-validataion, missing data and examples, J. Chemom., 9, 459-470 (1995)
[37] Rosipal, R. and N. Krämer (2005): “Overview and recent advances in partial least squares.” In: Subspace, Latent Structure and Feature Selection, Statistical and Optimization, pp. 34-51.; Rosipal, R.; Krämer, N., Overview and recent advances in partial least squares, In: Subspace, Latent Structure and Feature Selection, Statistical and Optimization, 34-51 (2005)
[38] Royston, P. (2004): “Multiple imputation of missing values,” Stata J., 4, 227-241.; Royston, P., Multiple imputation of missing values, Stata J., 4, 227-241 (2004) · Zbl 0072.35704
[39] Rubin, D. B. (1987): Multiple imputation for nonresponse in surveys, John Wiley & Son, New York, New York.; Rubin, D. B., Multiple imputation for nonresponse in surveys (1987) · Zbl 1070.62007
[40] Rubin, D. B. (1996): “Multiple imputation after 18+ years,” J. Am. Stat. Assoc., 91, 473-489.; Rubin, D. B., Multiple imputation after 18+ years, J. Am. Stat. Assoc., 91, 473-489 (1996) · Zbl 0869.62014
[41] Sawatsky, M. L., M. Clyde and F. Meek (2015): “Partial least squares regression in the social sciences,” Quant. Method Psychol., 11, 52-62.; Sawatsky, M. L.; Clyde, M.; Meek, F., Partial least squares regression in the social sciences, Quant. Method Psychol., 11, 52-62 (2015)
[42] Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6, 461-464.; Schwarz, G., Estimating the dimension of a model, Ann. Stat., 6, 461-464 (1978) · Zbl 0379.62005
[43] Serneels, S. and T. Verdonck (2008): “Principal component regression for data containing outliers and missing elements,” Comput. Stat. Data An., 52, 1712-1727.; Serneels, S.; Verdonck, T., Principal component regression for data containing outliers and missing elements, Comput. Stat. Data An., 52, 1712-1727 (2008) · Zbl 1452.62419
[44] Stone, M. (1974): “Cross-validatory choice and assessment of statistical predictions,” J. R. Stat. Soc., 36, 111-147.; Stone, M., Cross-validatory choice and assessment of statistical predictions, J. R. Stat. Soc., 36, 111-147 (1974) · Zbl 0308.62063
[45] Templ, M., A. Alfons, A. Kowarik and B. Prantner (2017): VIM: visualization and imputation of missing values. R package version 4.8.0.; Templ, M.; Alfons, A.; Kowarik, A.; Prantner, B., VIM: visualization and imputation of missing values (2017)
[46] Tenenhaus, M. (1998): La Régression PLS: théorie et pratique, Editions Technip.; Tenenhaus, M., La Régression PLS: théorie et pratique,Editions Technip (1998) · Zbl 0923.62058
[47] Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman. (2001): “Missing value estimation methods for DNA microarrays,” Bioinformatics, 17, 520-525.; Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R. B., Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520-525 (2001)
[48] Van Buuren, S. (2007): “Multiple imputation of discrete and continuous data by fully conditional specification,” Stat. Methods Med. Res., 16, 219-242.; Van Buuren, S., Multiple imputation of discrete and continuous data by fully conditional specification, Stat. Methods Med. Res., 16, 219-242 (2007) · Zbl 1122.62382
[49] Van Buuren, S. (2012): Flexible imputation of missing data, Chapman & Hall/CRC, Boca Raton.; Van Buuren, S., Flexible imputation of missing data (2012) · Zbl 1256.62005
[50] Van Buuren, S. (2018): mice: Multivariate imputation by chained equations. R package version 3.3.0.; Van Buuren, S., mice: Multivariate imputation by chained equations (2018)
[51] Van Buuren, S. and K. Groothuis-Oudshoorn (2011): mice: Multivariate imputation by chained equation in R,“ J. Stat. Softw., 45.; <element-citation publication-type=”journal“ publication-format=”print”> Van Buuren, S.Groothuis-Oudshoorn, K.2011mice: Multivariate imputation by chained equation in RJ. Stat. Softw.45
[52] Wakeling, I. N. and J. J. Morris (1993): “A test of significance for partial least squares regression,” J. Chemom., 7, 291-304.; Wakeling, I. N.; Morris, J. J., A test of significance for partial least squares regression, J. Chemom., 7, 291-304 (1993)
[53] White, I. R., P. Royston and A. M. Wood (2011): “Multiple imputation using chained equations: issues and guidance for practice,” Stat. Med., 30, 377-399.; White, I. R.; Royston, P.; Wood, A. M., Multiple imputation using chained equations: issues and guidance for practice, Stat. Med., 30, 377-399 (2011)
[54] Wiklund, S., D. Nilsson, L. Eriksson, M. Sjöström, S. Wold and K. Faber (2007): “A randomization test for PLS component selection,” J. Chemom., 21, 427-439.; Wiklund, S.; Nilsson, D.; Eriksson, L.; Sjöström, M.; Wold, S.; Faber, K., A randomization test for PLS component selection, J. Chemom., 21, 427-439 (2007)
[55] Wold, H. (1966): Estimation of principal components and related models by iterative least squares, volume 1. Academic Press, New York.; Wold, H., Estimation of principal components and related models by iterative least squares, volume 1 (1966) · Zbl 0214.46103
[56] Wold, S., K. Esbensen and P. Geladi (1987): “Principal component analysis,” Chemom. Intell. Lab. Syst., 2, 37-52.; Wold, S.; Esbensen, K.; Geladi, P., Principal component analysis, Chemom. Intell. Lab. Syst., 2, 37-52 (1987)
[57] Wold, S., M. Sjöström and L. Eriksson (2001): “PLS-regression: a basic tool of chemometrics,” Chemom. Intell. Lab. Syst., 58, 109-130.; Wold, S.; Sjöström, M.; Eriksson, L., PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst., 58, 109-130 (2001)
[58] Yang, T. C., L. S. Aucott, G. G. Duthie and H. M. Macdonald (2017): “An application of partial least squares for identifying dietary patterns in bone health,” Arch. osteoporosis, 12, 63.; Yang, T. C.; Aucott, L. S.; Duthie, G. G.; Macdonald, H. M., An application of partial least squares for identifying dietary patterns in bone health, Arch steoporosis, 12, 63 (2017)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.