×

PLS for Big Data: a unified parallel algorithm for regularised group PLS. (English) Zbl 1431.62249

This article surveys partial least squares methods for two blocks of data. A general framework to deal with both symmetric and asymmetric methods is built. Group structure is also explored. Variable selection techniques based on penalized singular value decomposition are employed in a new unified algorithm that can perform different Partial Least Squares methods, and their regularized versions. Further extensions to deal with massive data sets are presented. The optimization criteria and algorithmic computation are detailed. Different approaches to decrease the computational time are explored. The performance of the algorithm and its scalability to large sample sizes is demonstrated on simulated data sets. The first simulation considers asymmetric model on group structured data while the second presents an extension to discriminant analysis.

MSC:

62H20 Measures of association (correlation, canonical correlation, etc.)
62R07 Statistical aspects of big data and data science
62J07 Ridge regression; shrinkage estimators (Lasso)
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Abdi, H. & Williams, L. J. (2013), ‘Partial least squares methods: partial least squares correlation and partial least square regression’, Methods Mol. Biol. 930, 549-579.
[2] Alin, A. (2009), ‘Comparison of pls algorithms when number of objects is much larger than number of variables’, Statistical Papers 50, 711-720. · Zbl 1247.62163 · doi:10.1007/s00362-009-0251-7
[3] Allen, G. I., Grosenick, L. & Taylor, J. (2014), ‘A generalized least-square matrix decomposition’, Journal of the American Statistical Association 109(505), 145-159. · Zbl 1367.62184 · doi:10.1080/01621459.2013.852978
[4] Allen, G. I., Peterson, C., Vannucci, M. & Maletic-Savatic, M. (2013), ‘Regularized Partial Least Squares with an Application to NMR Spectroscopy’, Statistical Analysis and Data Mining 6(4), 302-314. · Zbl 07260370
[5] Allen, G. I. & Tibshirani, R. (2010), ‘Transposable regularized covariance models with an application to missing data imputation’, Ann Appl Stat 4(2), 764-790. · Zbl 1194.62079 · doi:10.1214/09-AOAS314
[6] Baglama, J. & Reichel, L. (2015), irlba: Fast Truncated SVD, PCA and Symmetric Eigendecomposition for Large Dense and Sparse Matrices. R package version 2.0.0. http://CRAN.R-project.org/package=irlba
[7] Barker, M. & Rayens, W. (2003), ‘Partial least squares for discrimination’, Journal of Chemometrics 17(3), 166-173.
[8] Boulesteix, A.-L. & Strimmer, K. (2007), ‘Partial least squares: a versatile tool for the analysis of high-dimensional genomic data’, Briefings in Bioinformatics 8(1), 32-44.
[9] Brown, P. J. & Zidek, J. V. (1980), ‘Adaptive multivariate ridge regression’, Ann. Statist. 8(1), 64-74. https://doi.org/10.1214/aos/1176344891 · Zbl 0425.62053 · doi:10.1214/aos/1176344891
[10] Cak, A. D., Moran, E. F., de O. Figueiredo, R., Lu, D., Li, G. & Hetrick, S. (2016), ‘Urbanization and small household agricultural land use choices in the Brazilian amazon and the role for the water chemistry of small streams’, Journal of Land Use Science 11(2), 203-221.
[11] Cardot, H. & Degras, D. (2017), ‘Online principal component analysis in high dimension: Which algorithm to choose?’, International Statistical Review. http://dx.doi.org/10.1111/insr.12220 · Zbl 07763574
[12] Chen, X. & Liu, H. (2012), ‘An efficient optimization algorithm for structured sparse cca, with applications to eqtl mapping’, Statistics in Biosciences 4(1), 3-26.
[13] Chun, H. & Keleş, S. (2010), ‘Sparse partial least squares regression for simultaneous dimension reduction and variable selection’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(1), 3-25. · Zbl 1411.62184 · doi:10.1111/j.1467-9868.2009.00723.x
[14] Chung, D. & Keleş, S. (2010), ‘Sparse Partial Least Squares Classification for High Dimensional Data’, Statistical Applications in Genetics and Molecular Biology 9(1), 17. · Zbl 1304.92041
[15] Cohen, G., Afshar, S., Tapson, J. & van Schaik, A. (2017), ‘EMNIST: an extension of MNIST to handwritten letters’, CoRR abs/1702.05373. http://arxiv.org/abs/1702.05373
[16] de Jong, S. (1993), ‘Simpls: an alternative approach to partial least squares regression’, Chemometrics and Intelligent Laboratory Systems 18, 251-263.
[17] Dhanjal, C., Gunn, S. R. & Shawe-Taylor, J. (2009), ‘Efficient sparse kernel feature extraction based on partial least squares’, IEEE Transactions on Pattern Analysis and Machine Intelligence 31(8), 1347-1361.
[18] Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized linear models via coordinate descent’, Journal of Statistical Software 33(1), 1-22. http://www.jstatsoft.org/v33/i01/
[19] Friedman, J., Hastie, T., Tibshirani, R., Simon, N., Narasimhan, B. & Qian, J. (2018), glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. R package version 2.0-16. https://CRAN.R-project.org/package=glmnet
[20] Geladi, P. & Kowalski, B. R. (1986), ‘Partial least-squares regression: a tutorial’, Analytica Chimica Acta 185, 1-17.
[21] Guo, G. & Mu, G. (2013), Joint estimation of age, gender and ethnicity: Cca vs. pls, in ‘10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG)’, pp. 1-6.
[22] Hardoon, D. R., Szedmak, S. & Shawe-Taylor, J. (2004), ‘Canonical correlation analysis: an overview with application to learning methods’, Neural Computation 16(12), 2639-2664. · Zbl 1062.68134 · doi:10.1162/0899766042321814
[23] Hastie, T., Tibshirani, R. & Friedman, J. H. (2009), The elements of statistical learning: data mining, inference, and prediction, 2nd Edition, Springer series in statistics, Springer. http://www.worldcat.org/oclc/300478243 · Zbl 1273.62005
[24] Höskuldsson, A. (1988), ‘Pls regression methods’, Journal of Chemometrics 2, 211-228.
[25] Hotelling, H. (1936), ‘Relations between two sets of variates’, Biometrika28(3-4), 321. · JFM 62.0618.04
[26] Ji, G., Yang, Z. & You, W. (2011), ‘Pls-based gene selection and identification of tumor-specific genes’, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41(6), 830-841.
[27] Kraemer, N. & Sugiyama, M. (2011), ‘The degrees of freedom of partial least squares regression’, Journal of the American Statistical Association 106(494). · Zbl 1232.62099
[28] Krishnan, A., Williams, L. J., McIntosh, A. R. & Abdi, H. (2011), ‘Partial least squares (pls) methods for neuroimaging: A tutorial and review’, NeuroImage 56(2), 455 - 475.
[29] Lafaye de Micheaux, P., Liquet, B. & Sutton, M. (2017), ‘A Unified Parallel Algorithm for Regularized Group PLS Scalable to Big Data’, ArXiv e-prints. · Zbl 1431.62249
[30] Lê Cao, K.-A., Rossouw, D., Robert-Granié, C. & Besse, P. (2008), ‘Sparse PLS: Variable Selection when Integrating Omics data’, Statistical Application and Molecular Biology 7((1):37). · Zbl 1276.62061
[31] LeCun, Y. & Cortes, C. (2010), ‘MNIST handwritten digit database’. http://yann.lecun.com/exdb/mnist/
[32] Liang, F., Shi, R. & Mo, Q. (2016), ‘A split-and-merge approach for singular value decomposition of large-scale matrices’, Statistics And Its Interface 9(4), 453-459. · Zbl 1405.62005 · doi:10.4310/SII.2016.v9.n4.a5
[33] Lin, D., Cao, H., Calhoun, V. D. & Wang, Y.-P. (2014), ‘Sparse models for correlative and integrative analysis of imaging and genetic data’, Journal of Neuroscience Methods 237, 69 - 78.
[34] Lindgren, F. & Rännar, S. (1998), ‘Alternative partial least squares (pls) algorithms’, Perspectives Drug Discovery and Design pp. 105-113.
[35] Liquet, B., Lafaye de Micheaux, P., Hejblum, B. & Thiébaut, R. (2016), ‘Group and sparse group partial least square approaches applied in genomics context’, Bioinformatics 32, 35-42.
[36] Liu, J. & Calhoun, V. D. (2014), ‘A review of multivariate analyses in imaging genetics’, Frontiers in Neuroinformatics 8(29).
[37] Lockhart, R., Taylor, J., Tibshirani, R. J. & Tibshirani, R. (2014), ‘A significance test for the lasso’, Ann Stat 42(2), 413-468. · Zbl 1305.62254 · doi:10.1214/13-AOS1175
[38] Lorenzi, M., Gutman, B., Hibar, D. P., Altmann, A., Jahanshad, N., Thompson, P. M. & Ourselin, S. (2016), Partial least squares modelling for imaging-genetics in Alzheimer’s disease: Plausibility and generalization, in ‘2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI)’, pp. 838-841.
[39] Lütkepohl, H. (2005), New introduction to multiple time series analysis, Springer-Verlag, Berlin. · Zbl 1072.62075
[40] Mackey, L. W. (2009), Deflation methods for sparse pca, in D. Koller, D. Schuurmans, Y. Bengio & L. Bottou, eds, ‘Advances in Neural Information Processing Systems 21’, Curran Associates, Inc., pp. 1017-1024.
[41] Mardia, K. V., Kent, J. T. & Bibby, J. M. (1979), Multivariate analysis / K.V. Mardia, J.T. Kent, J.M. Bibby, Academic Press London; New York. · Zbl 0432.62029
[42] McIntosh, A. R., Bookstein, F. L., Haxby, J. V. & Grady, C. L. (1996), ‘Spatial pattern analysis of functional brain images using partial least squares’, NeuroImage 3(3), 143-157.
[43] Meyer, C. D. (2000), Matrix Analysis and Applied Linear Algebra, SIAM. · Zbl 0962.15001
[44] Netrapalli, P., Jain, P. & Sanghavi, S. (2015), ‘Phase retrieval using alternating minimization’, IEEE Transactions on Signal Processing 63(18), 4814-4826. · Zbl 1394.94421 · doi:10.1109/TSP.2015.2448516
[45] Nguyen, D. & Rocke, D. (2002), ‘Tumor classification by partial least squares using microarray gene expression data’, Bioinformatics 18(1), 39-50.
[46] Nicole Kraemer, M. L. B. (2018), plsdof: Degrees of Freedom and Statistical Inference for Partial Least Squares Regression. R package version 0.2-8. https://CRAN.R-project.org/package=plsdof
[47] Nielsen, F. A. (2002), Neuroinformatics in Functional Neuroimaging, PhD thesis, Technical University of Denmark, Lyngby.
[48] Palermo, R. E., Patterson, L. J., Aicher, L. D., Korth, M. J., Robert-Guroff, M. & Katze, M. G. (2011), ‘Genomic analysis reveals pre- and postchallenge differences in a rhesus macaque aids vaccine trial: Insights into mechanisms of vaccine efficacy’, Journal of Virology 85(2), 1099-1116.
[49] Phatak, A. & de Jong, S. (1997), ‘The geometry of partial least squares’, Journal of Chemometrics 11(4), 311-338. · Zbl 0892.62040
[50] R Core Team (2017), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[51] Rohlf, F. J. & Corti, M. (2000), ‘Use of two-block partial least-squares to study covariation in shape’, Systematic Biology 49(4), 740-753.
[52] Roon, P. V., Zakizadeh, J. & Chartier, S. (2014), ‘Partial least squares tutorial for analyzing neuroimaging data’, The Quantitative Methods for Psychology 10(2), 200-215.
[53] Rosipal, R. & Krämer, N. (2006), Overview and recent advances in partial least squares, in ‘Subspace, Latent Structure and Feature Selection: Statistical and Optimization Perspectives Workshop’, pp. 34-51.
[54] S. E. Leurgans, R. A. Moyeed, B. W. S. (1993), ‘Canonical correlation analysis when the data are curves’, Journal of the Royal Statistical Society. Series B (Methodological) 55(3), 725-740. · Zbl 0803.62049 · doi:10.1111/j.2517-6161.1993.tb01936.x
[55] Shen, H. & Huang, J. Z. (2008), ‘Sparse principal component analysis via regularized low rank matrix approximation’, Journal of Multivariate Analysis 99(6), 1015 - 1034. · Zbl 1141.62049 · doi:10.1016/j.jmva.2007.06.007
[56] Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. (2013), ‘A sparse-group lasso’, Journal of Computational and Graphical Statistics 22(2), 231-245.
[57] Sutton, M., Thiebaut, T. & Liquet, B. (2018), ‘Sparse partial least squares with group and subgroup structure’, Statistics in Medicine 37(23), 3338-33356.
[58] Tenenhaus, M. (1998), La régression PLS: Théorie et Pratique, Paris: Technip. · Zbl 0923.62058
[59] ter Braak, C. J. F. & de Jong, S. (1998), ‘The objective function of partial least squares regression’, Journal of Chemometrics 12(1), 41-54.
[60] Tibshirani, R. (1994), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society, Series B 58, 267-288. · Zbl 0850.62538 · doi:10.1111/j.2517-6161.1996.tb02080.x
[61] Tibshirani, R. J. & Taylor, J. (2011), ‘The solution path of the generalized lasso’, Annals of Statistics 39(3), 1335-1371. · Zbl 1234.62107 · doi:10.1214/11-AOS878
[62] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. & Knight, K. (2005), ‘Sparsity and smoothness via the fused lasso’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(1), 91-108. · Zbl 1060.62049 · doi:10.1111/j.1467-9868.2005.00490.x
[63] Tibshirani, R., Tibshirani, R., Taylor, J., Loftus, J. & Reid, S. (2017), selectiveInference: Tools for Post-Selection Inference. R package version 1.2.4. https://CRAN.R-project.org/package=selectiveInference
[64] Tseng, P. (1988), Coordinate ascent for maximising nondifferentiable concave functions, Technical report, Massachusetts Institute of Technology. Laboratory for Information and Decision Systems.Cambridge MA.
[65] Vinod, H. (1976), ‘Canonical ridge and econometrics of joint production’, Journal of Econometrics 4(2), 147 - 166. · Zbl 0331.62079 · doi:10.1016/0304-4076(76)90010-5
[66] Vinzi, V., Trinchera, L. & Amato, S. (2010), ‘Pls path modeling: from foundations to recent developments and open issues for model assessment and improvement’, Handbook of Partial Least Squares pp. 47-82.
[67] Wegelin, J. A. (2000), A survey of partial least squares (pls) methods, with emphasis on the two-block case, Technical report, University of Washington.
[68] Witten, D. M., Tibshirani, R. & Hastie, T. (2009), ‘A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis’, Biostatistics 10(3), 515-534. · Zbl 1437.62658
[69] Wold, H. (1966), Estimation of principal components and related models by iterative least squares, in ‘Multivariate Analysis’, Academic Press, New York, Wiley, Dayton, Ohio, pp. 391-420. · Zbl 0214.46103
[70] Wold, S., Ruhe, A., Wold, H. & Dunn, W. J. (1984), ‘The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses’, SIAM Journal on Scientificic and Statistical Computing 5(3), 735-743. · Zbl 0545.62044 · doi:10.1137/0905052
[71] Wold, S., Sjöström, M. & Eriksson, L. (2001), ‘Pls-regression: a basic tool of chemometrics’, Chemometrics and Intelligent Laboratory Systems 58(2), 109 - 130.
[72] Yee, T. W. (2018), VGAM: Vector Generalized Linear and Additive Models. R package version 1.0-6. https://CRAN.R-project.org/package=VGAM
[73] Yee, T. W. & Wild, C. J. (1996), ‘Vector generalized additive models’, Journal of the Royal Statistical Society. Series B (Methodological) 58(3), 481-493. http://www.jstor.org/stable/2345888 · Zbl 0855.62059 · doi:10.1111/j.2517-6161.1996.tb02095.x
[74] Yeniay, O. & Goktas, A. (2002), ‘A comparison of partial least squares regression with other prediction methods’, Hacettepe Journal of Mathematics and Statistics 31(99), 99-101. · Zbl 1029.62061
[75] Yuan, M. & Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49-67. · Zbl 1141.62030 · doi:10.1111/j.1467-9868.2005.00532.x
[76] Zeng, Y. & Breheny, P. (2017a), ‘The biglasso package: A memory- and computation-efficient solver for lasso model fitting with big data in r’, ArXiv e-prints. https://arxiv.org/abs/1701.05936
[77] Zeng, Y. & Breheny, P. (2017b), The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. R package version 1.3. https://CRAN.R-project.org/package=biglasso
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.