×

PLS for Big Data: a unified parallel algorithm for regularised group PLS. (English) Zbl 1431.62249

This article surveys partial least squares methods for two blocks of data. A general framework to deal with both symmetric and asymmetric methods is built. Group structure is also explored. Variable selection techniques based on penalized singular value decomposition are employed in a new unified algorithm that can perform different Partial Least Squares methods, and their regularized versions. Further extensions to deal with massive data sets are presented. The optimization criteria and algorithmic computation are detailed. Different approaches to decrease the computational time are explored. The performance of the algorithm and its scalability to large sample sizes is demonstrated on simulated data sets. The first simulation considers asymmetric model on group structured data while the second presents an extension to discriminant analysis.

MSC:

62H20 Measures of association (correlation, canonical correlation, etc.)
62R07 Statistical aspects of big data and data science
62J07 Ridge regression; shrinkage estimators (Lasso)
PDF BibTeX XML Cite
Full Text: DOI arXiv Euclid

References:

[1] Abdi, H. & Williams, L. J. (2013), ‘Partial least squares methods: partial least squares correlation and partial least square regression’, Methods Mol. Biol.930, 549-579.
[2] Alin, A. (2009), ‘Comparison of pls algorithms when number of objects is much larger than number of variables’, Statistical Papers50, 711-720. · Zbl 1247.62163
[3] Allen, G. I., Grosenick, L. & Taylor, J. (2014), ‘A generalized least-square matrix decomposition’, Journal of the American Statistical Association109(505), 145-159. · Zbl 1367.62184
[4] Allen, G. I., Peterson, C., Vannucci, M. & Maletic-Savatic, M. (2013), ‘Regularized Partial Least Squares with an Application to NMR Spectroscopy’, Statistical Analysis and Data Mining6(4), 302-314.
[5] Allen, G. I. & Tibshirani, R. (2010), ‘Transposable regularized covariance models with an application to missing data imputation’, Ann Appl Stat4(2), 764-790. · Zbl 1194.62079
[6] Baglama, J. & Reichel, L. (2015), irlba: Fast Truncated SVD, PCA and Symmetric Eigendecomposition for Large Dense and Sparse Matrices. R package version 2.0.0. http://CRAN.R-project.org/package=irlba
[7] Barker, M. & Rayens, W. (2003), ‘Partial least squares for discrimination’, Journal of Chemometrics17(3), 166-173.
[8] Boulesteix, A.-L. & Strimmer, K. (2007), ‘Partial least squares: a versatile tool for the analysis of high-dimensional genomic data’, Briefings in Bioinformatics8(1), 32-44.
[9] Brown, P. J. & Zidek, J. V. (1980), ‘Adaptive multivariate ridge regression’, Ann. Statist.8(1), 64-74. https://doi.org/10.1214/aos/1176344891 · Zbl 0425.62053
[10] Cak, A. D., Moran, E. F., de O. Figueiredo, R., Lu, D., Li, G. & Hetrick, S. (2016), ‘Urbanization and small household agricultural land use choices in the Brazilian amazon and the role for the water chemistry of small streams’, Journal of Land Use Science11(2), 203-221.
[11] Cardot, H. & Degras, D. (2017), ‘Online principal component analysis in high dimension: Which algorithm to choose?’, International Statistical Review. http://dx.doi.org/10.1111/insr.12220
[12] Chen, X. & Liu, H. (2012), ‘An efficient optimization algorithm for structured sparse cca, with applications to eqtl mapping’, Statistics in Biosciences4(1), 3-26.
[13] Chun, H. & Keleş, S. (2010), ‘Sparse partial least squares regression for simultaneous dimension reduction and variable selection’, Journal of the Royal Statistical Society: Series B (Statistical Methodology)72(1), 3-25. · Zbl 1411.62184
[14] Chung, D. & Keleş, S. (2010), ‘Sparse Partial Least Squares Classification for High Dimensional Data’, Statistical Applications in Genetics and Molecular Biology9(1), 17.
[15] Cohen, G., Afshar, S., Tapson, J. & van Schaik, A. (2017), ‘EMNIST: an extension of MNIST to handwritten letters’, CoRRabs/1702.05373. http://arxiv.org/abs/1702.05373
[16] de Jong, S. (1993), ‘Simpls: an alternative approach to partial least squares regression’, Chemometrics and Intelligent Laboratory Systems18, 251-263.
[17] Dhanjal, C., Gunn, S. R. & Shawe-Taylor, J. (2009), ‘Efficient sparse kernel feature extraction based on partial least squares’, IEEE Transactions on Pattern Analysis and Machine Intelligence31(8), 1347-1361.
[18] Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized linear models via coordinate descent’, Journal of Statistical Software33(1), 1-22. http://www.jstatsoft.org/v33/i01/
[19] Friedman, J., Hastie, T., Tibshirani, R., Simon, N., Narasimhan, B. & Qian, J. (2018), glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. R package version 2.0-16. https://CRAN.R-project.org/package=glmnet
[20] Geladi, P. & Kowalski, B. R. (1986), ‘Partial least-squares regression: a tutorial’, Analytica Chimica Acta185, 1-17.
[21] Guo, G. & Mu, G. (2013), Joint estimation of age, gender and ethnicity: Cca vs. pls, in ‘10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG)’, pp. 1-6.
[22] Hardoon, D. R., Szedmak, S. & Shawe-Taylor, J. (2004), ‘Canonical correlation analysis: an overview with application to learning methods’, Neural Computation16(12), 2639-2664. · Zbl 1062.68134
[23] Hastie, T., Tibshirani, R. & Friedman, J. H. (2009), The elements of statistical learning: data mining, inference, and prediction, 2nd Edition, Springer series in statistics, Springer. http://www.worldcat.org/oclc/300478243 · Zbl 1273.62005
[24] Höskuldsson, A. (1988), ‘Pls regression methods’, Journal of Chemometrics2, 211-228.
[25] Hotelling, H. (1936), ‘Relations between two sets of variates’, Biometrika28(3-4), 321. · JFM 62.0618.04
[26] Ji, G., Yang, Z. & You, W. (2011), ‘Pls-based gene selection and identification of tumor-specific genes’, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)41(6), 830-841.
[27] Kraemer, N. & Sugiyama, M. (2011), ‘The degrees of freedom of partial least squares regression’, Journal of the American Statistical Association106(494).
[28] Krishnan, A., Williams, L. J., McIntosh, A. R. & Abdi, H. (2011), ‘Partial least squares (pls) methods for neuroimaging: A tutorial and review’, NeuroImage56(2), 455 - 475.
[29] Lafaye de Micheaux, P., Liquet, B. & Sutton, M. (2017), ‘A Unified Parallel Algorithm for Regularized Group PLS Scalable to Big Data’, ArXiv e-prints.
[30] Lê Cao, K.-A., Rossouw, D., Robert-Granié, C. & Besse, P. (2008), ‘Sparse PLS: Variable Selection when Integrating Omics data’, Statistical Application and Molecular Biology7((1):37). · Zbl 1276.62061
[31] LeCun, Y. & Cortes, C. (2010), ‘MNIST handwritten digit database’. http://yann.lecun.com/exdb/mnist/
[32] Liang, F., Shi, R. & Mo, Q. (2016), ‘A split-and-merge approach for singular value decomposition of large-scale matrices’, Statistics And Its Interface9(4), 453-459. · Zbl 1405.62005
[33] Lin, D., Cao, H., Calhoun, V. D. & Wang, Y.-P. (2014), ‘Sparse models for correlative and integrative analysis of imaging and genetic data’, Journal of Neuroscience Methods237, 69 - 78.
[34] Lindgren, F. & Rännar, S. (1998), ‘Alternative partial least squares (pls) algorithms’, Perspectives Drug Discovery and Design pp. 105-113.
[35] Liquet, B., Lafaye de Micheaux, P., Hejblum, B. & Thiébaut, R. (2016), ‘Group and sparse group partial least square approaches applied in genomics context’, Bioinformatics32, 35-42.
[36] Liu, J. & Calhoun, V. D. (2014), ‘A review of multivariate analyses in imaging genetics’, Frontiers in Neuroinformatics8(29).
[37] Lockhart, R., Taylor, J., Tibshirani, R. J. & Tibshirani, R. (2014), ‘A significance test for the lasso’, Ann Stat42(2), 413-468. · Zbl 1305.62254
[38] Lorenzi, M., Gutman, B., Hibar, D. P., Altmann, A., Jahanshad, N., Thompson, P. M. & Ourselin, S. (2016), Partial least squares modelling for imaging-genetics in Alzheimer’s disease: Plausibility and generalization, in ‘2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI)’, pp. 838-841.
[39] Lütkepohl, H. (2005), New introduction to multiple time series analysis, Springer-Verlag, Berlin.
[40] Mackey, L. W. (2009), Deflation methods for sparse pca, in D. Koller, D. Schuurmans, Y. Bengio & L. Bottou, eds, ‘Advances in Neural Information Processing Systems 21’, Curran Associates, Inc., pp. 1017-1024.
[41] Mardia, K. V., Kent, J. T. & Bibby, J. M. (1979), Multivariate analysis / K.V. Mardia, J.T. Kent, J.M. Bibby, Academic Press London; New York. · Zbl 0432.62029
[42] McIntosh, A. R., Bookstein, F. L., Haxby, J. V. & Grady, C. L. (1996), ‘Spatial pattern analysis of functional brain images using partial least squares’, NeuroImage3(3), 143-157.
[43] Meyer, C. D. (2000), Matrix Analysis and Applied Linear Algebra, SIAM.
[44] Netrapalli, P., Jain, P. & Sanghavi, S. (2015), ‘Phase retrieval using alternating minimization’, IEEE Transactions on Signal Processing63(18), 4814-4826. · Zbl 1394.94421
[45] Nguyen, D. & Rocke, D. (2002), ‘Tumor classification by partial least squares using microarray gene expression data’, Bioinformatics18(1), 39-50.
[46] Nicole Kraemer, M. L. B. (2018), plsdof: Degrees of Freedom and Statistical Inference for Partial Least Squares Regression. R package version 0.2-8. https://CRAN.R-project.org/package=plsdof
[47] Palermo, R. E., Patterson, L. J., Aicher, L. D., Korth, M. J., Robert-Guroff, M. & Katze, M. G. (2011), ‘Genomic analysis reveals pre- and postchallenge differences in a rhesus macaque aids vaccine trial: Insights into mechanisms of vaccine efficacy’, Journal of Virology85(2), 1099-1116.
[48] Phatak, A. & de Jong, S. (1997), ‘The geometry of partial least squares’, Journal of Chemometrics11(4), 311-338. · Zbl 0892.62040
[49] R Core Team (2017), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[50] Rohlf, F. J. & Corti, M. (2000), ‘Use of two-block partial least-squares to study covariation in shape’, Systematic Biology49(4), 740-753.
[51] Roon, P. V., Zakizadeh, J. & Chartier, S. (2014), ‘Partial least squares tutorial for analyzing neuroimaging data’, The Quantitative Methods for Psychology10(2), 200-215.
[52] Rosipal, R. & Krämer, N. (2006), Overview and recent advances in partial least squares, in ‘Subspace, Latent Structure and Feature Selection: Statistical and Optimization Perspectives Workshop’, pp. 34-51.
[53] S. E. Leurgans, R. A. Moyeed, B. W. S. (1993), ‘Canonical correlation analysis when the data are curves’, Journal of the Royal Statistical Society. Series B (Methodological)55(3), 725-740. · Zbl 0803.62049
[54] Shen, H. & Huang, J. Z. (2008), ‘Sparse principal component analysis via regularized low rank matrix approximation’, Journal of Multivariate Analysis99(6), 1015 - 1034. · Zbl 1141.62049
[55] Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. (2013), ‘A sparse-group lasso’, Journal of Computational and Graphical Statistics22(2), 231-245.
[56] Sutton, M., Thiebaut, T. & Liquet, B. (2018), ‘Sparse partial least squares with group and subgroup structure’, Statistics in Medicine37(23), 3338-33356.
[57] Tenenhaus, M. (1998), La régression PLS: Théorie et Pratique, Paris: Technip. · Zbl 0923.62058
[58] ter Braak, C. J. F. & de Jong, S. (1998), ‘The objective function of partial least squares regression’, Journal of Chemometrics12(1), 41-54.
[59] Tibshirani, R. (1994), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society, Series B58, 267-288. · Zbl 0850.62538
[60] Tibshirani, R. J. & Taylor, J. (2011), ‘The solution path of the generalized lasso’, Annals of Statistics39(3), 1335-1371. · Zbl 1234.62107
[61] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. & Knight, K. (2005), ‘Sparsity and smoothness via the fused lasso’, Journal of the Royal Statistical Society: Series B (Statistical Methodology)67(1), 91-108. · Zbl 1060.62049
[62] Tibshirani, R., Tibshirani, R., Taylor, J., Loftus, J. & Reid, S. (2017), selectiveInference: Tools for Post-Selection Inference. R package version 1.2.4. https://CRAN.R-project.org/package=selectiveInference
[63] Vinod, H. (1976), ‘Canonical ridge and econometrics of joint production’, Journal of Econometrics4(2), 147 - 166. · Zbl 0331.62079
[64] Vinzi, V., Trinchera, L. & Amato, S. (2010), ‘Pls path modeling: from foundations to recent developments and open issues for model assessment and improvement’, Handbook of Partial Least Squares pp. 47-82. · Zbl 1186.62001
[65] Witten, D. M., Tibshirani, R. & Hastie, T. (2009), ‘A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis’, Biostatistics10(3), 515-534.
[66] Wold, H. (1966), Estimation of principal components and related models by iterative least squares, in ‘Multivariate Analysis’, Academic Press, New York, Wiley, Dayton, Ohio, pp. 391-420.
[67] Wold, S., Ruhe, A., Wold, H. & Dunn, W. J. (1984), ‘The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses’, SIAM Journal on Scientificic and Statistical Computing5(3), 735-743. · Zbl 0545.62044
[68] Wold, S., Sjöström, M. & Eriksson, L. (2001), ‘Pls-regression: a basic tool of chemometrics’, Chemometrics and Intelligent Laboratory Systems58(2), 109 - 130.
[69] Yee, T. W. (2018), VGAM: Vector Generalized Linear and Additive Models. R package version 1.0-6. https://CRAN.R-project.org/package=VGAM
[70] Yee, T. W. & Wild, C. J. (1996), ‘Vector generalized additive models’, Journal of the Royal Statistical Society. Series B (Methodological)58(3), 481-493. http://www.jstor.org/stable/2345888 · Zbl 0855.62059
[71] Yeniay, O. & Goktas, A. (2002), ‘A comparison of partial least squares regression with other prediction methods’, Hacettepe Journal of Mathematics and Statistics31(99), 99-101. · Zbl 1029.62061
[72] Yuan, M. & Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology)68(1), 49-67. · Zbl 1141.62030
[73] Zeng, Y. & Breheny, P. (2017a), ‘The biglasso package: A memory- and computation-efficient solver for lasso model fitting with big data in r’, ArXiv e-prints. https://arxiv.org/abs/1701.05936
[74] Zeng, Y. & Breheny, P. (2017b), The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. R package version 1.3. https://CRAN.R-project.org/package=biglasso
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.