Functional principal subspace sampling for large scale functional data analysis. (English) Zbl 1493.62642

Summary: Functional data analysis (FDA) methods have computational and theoretical appeals for some high dimensional data, but lack the scalability to modern large sample datasets. To tackle the challenge, we develop randomized algorithms for two important FDA methods: functional principal component analysis (FPCA) and functional linear regression (FLR) with scalar response. The two methods are connected as they both rely on the accurate estimation of functional principal subspace. The proposed algorithms draw subsamples from the large dataset at hand and apply FPCA or FLR over the subsamples to reduce the computational cost. To effectively preserve subspace information in the subsamples, we propose a functional principal subspace sampling probability, which removes the eigenvalue scale effect inside the functional principal subspace and properly weights the residual. Based on the operator perturbation analysis, we show the proposed probability has precise control over the first order error of the subspace projection operator and can be interpreted as an importance sampling for functional subspace estimation. Moreover, concentration bounds for the proposed algorithms are established to reflect the low intrinsic dimension nature of functional data in an infinite dimensional space. The effectiveness of the proposed algorithms is demonstrated upon synthetic and real datasets.


62R10 Functional data analysis
62H25 Factor analysis and principal components; correspondence analysis
62D99 Statistical sampling theory and related topics


Full Text: DOI arXiv Link


[1] BERMEJO, J. M., RAMOS, A. A. and PRIETO, C. A. (2013). A PCA approach to stellar effective temperatures. Astronomy & Astrophysics 553 A95.
[2] CARDOT, H., DEGRAS, D. and JOSSERAND, E. (2013). Confidence bands for Horvitz-Thompson estimators using sampled noisy functional data. Bernoulli 19 2067-2097. · Zbl 1457.62394
[3] CARDOT, H., GOGA, C. and LARDIN, P. (2013). Uniform convergence and asymptotic confidence bands for model-assisted estimators of the mean of sampled functional data. Electronic journal of statistics 7 562-596. · Zbl 1336.62043
[4] CARDOT, H. and JOSSERAND, E. (2011). Horvitz-Thompson estimators for functional data: Asymptotic confidence bands and optimal allocation for stratified sampling. Biometrika 98 107-118. · Zbl 1214.62009
[5] CARDOT, H., MAS, A. and SARDA, P. (2007). CLT in functional linear regression models. Probability Theory and Related Fields 138 325-361. · Zbl 1113.60025
[6] CARDOT, H., CHAOUCH, M., GOGA, C. and LABRUÈRE, C. (2010). Properties of design-based functional principal components analysis. Journal of statistical planning and inference 140 75-91. · Zbl 1178.62067
[7] CONNOLLY, A. J., SZALAY, A., BERSHADY, M., KINNEY, A. and CALZETTI, D. (1994). Spectral classification of galaxies: an orthogonal approach. arXiv preprint astro-ph/9411044.
[8] DEGRAS, D. (2014). Rotation sampling for functional data. Statistica Sinica 1075-1095. · Zbl 06431821
[9] DELAIGLE, A., HALL, P. et al. (2010). Defining probability density for a distribution of random functions. The Annals of Statistics 38 1171-1193. · Zbl 1183.62061
[10] DICKER, L. H., FOSTER, D. P., HSU, D. et al. (2017). Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators. Electronic Journal of Statistics 11 1022-1047. · Zbl 1362.62087
[11] DRINEAS, P., KANNAN, R. and MAHONEY, M. W. (2006a). Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication. SIAM Journal on Computing 36 132-157. · Zbl 1111.68147
[12] DRINEAS, P., KANNAN, R. and MAHONEY, M. W. (2006b). Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on computing 36 158-183. · Zbl 1111.68148
[13] DRINEAS, P., KANNAN, R. and MAHONEY, M. W. (2006c). Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing 36 184-206. · Zbl 1111.68149
[14] DRINEAS, P., MAHONEY, M. W. and MUTHUKRISHNAN, S. (2006). Subspace Sampling and Relative-Error Matrix Approximation: Column-Row-Based Methods. In Algorithms - ESA 2006, 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006, Proceedings. · Zbl 1131.68589
[15] DRINEAS, P. and MAHONEY, M. W. (2018). Lectures on randomized numerical linear algebra. The Mathematics of Data 25 1. · Zbl 1448.68004
[16] DRINEAS, P., MAHONEY, M. W., MUTHUKRISHNAN, S. and SARLÓS, T. (2011). Faster least squares approximation. Numerische mathematik 117 219-249. · Zbl 1218.65037
[17] DRINEAS, P., MAGDON-ISMAIL, M., MAHONEY, M. W. and WOODRUFF, D. P. (2012). Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research 13 3475-3506. · Zbl 1437.65030
[18] EISENSTEIN, D. J., WEINBERG, D. H., AGOL, E., AIHARA, H., PRIETO, C. A., ANDERSON, S. F., ARNS, J. A., AUBOURG, É., BAILEY, S., BALBINOT, E. et al. (2011). SDSS-III: Massive spectroscopic surveys of the distant universe, the Milky Way, and extra-solar planetary systems. The Astronomical Journal 142 72.
[19] HADJIPANTELIS, P. Z. and MÜLLER, H.-G. (2018). Functional data analysis for big data: A case study on california temperature trends. In Handbook of Big Data Analytics 457-483. Springer.
[20] HALKO, N., MARTINSSON, P. and TROPP, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. Siam Review 53 217-288. · Zbl 1269.65043
[21] HALL, P., HOROWITZ, J. L. et al. (2007). Methodology and convergence rates for functional linear regression. The Annals of Statistics 35 70-91. · Zbl 1114.62048
[22] HE, S. and YAN, X. (2020). Randomized estimation of functional covariance operator via subsampling. Stat 9 e311.
[23] HORVÁTH, L. and KOKOSZKA, P. (2012). Inference for functional data with applications 200. Springer Science & Business Media. · Zbl 1279.62017
[24] Horváth, L., Kokoszka, P. and Reeder, R. (2013). Estimation of the Mean of Functional Time Series and a Two-Sample Problem. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 103-122. · Zbl 07555440
[25] HSING, T. and EUBANK, R. (2015). Theoretical foundations of functional data analysis, with an introduction to linear operators. John Wiley & Sons. · Zbl 1338.62009
[26] JAMES, G. M., HASTIE, T. J. and SUGAR, C. A. (2000). Principal component models for sparse functional data. Biometrika 87 587-602. · Zbl 0962.62056
[27] KATO, K. (2012). Estimation in functional linear quantile regression. Annals of Statistics 40 3108-3136. · Zbl 1296.62104
[28] KOKOSZKA, P. and REIMHERR, M. (2017). Introduction to functional data analysis. CRC Press. · Zbl 1411.62004
[29] KOLTCHINSKII, V., LOUNICI, K. et al. (2016). Asymptotics and concentration bounds for bilinear forms of spectral projectors of sample covariance. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 52 1976-2013. Institut Henri Poincaré. · Zbl 1353.62053
[30] LARDIN-PUECH, P., CARDOT, H. and GOGA, C. (2014). Analysing large datasets of functional data: a survey sampling point of view. Journal de la Société Française de Statistique 155 70-94. · Zbl 1316.62019
[31] LI, X., WU, Q. J., LUO, A., ZHAO, Y., LU, Y., ZUO, F., YANG, T. and WANG, Y. (2014). SDSS/SEGUE spectral feature analysis for stellar atmospheric parameter estimation. The Astrophysical Journal 790 105.
[32] LIU, C., CUI, W.-Y., ZHANG, B., WAN, J.-C., DENG, L.-C., HOU, Y.-H., WANG, Y.-F., YANG, M. and ZHANG, Y. (2015). Spectral classification of stars based on LAMOST spectra. Research in Astronomy and Astrophysics 15 1137.
[33] LIU, P., DI, L., DU, Q. and WANG, L. (2018). Remote Sensing Big Data: Theory, Methods and Applications. Remote Sensing 10 711.
[34] Ma, P., Mahoney, M. W. and Yu, B. (2015). A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16 861-911. · Zbl 1337.62164
[35] MCGURK, R. C., KIMBALL, A. E. and IVEZIC, Z. (2010). Principal Component Analysis of SDSS Stellar Spectra. Astron. J. 139 1261-1268.
[36] MINSKER, S. (2017). On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters 127 111-119. · Zbl 1377.60018
[37] MOR-YOSEF, L. and AVRON, H. (2019). Sketching for principal component regression. SIAM Journal on Matrix Analysis and Applications 40 454-485. · Zbl 1416.65105
[38] PENG, J. and PAUL, D. (2009). A geometric approach to maximum likelihood estimation of the functional principal components from sparse longitudinal data. Journal of Computational and Graphical Statistics 18 995-1015.
[39] PILANCI, M. and WAINWRIGHT, M. J. (2015). Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory 61 5096-5115. · Zbl 1359.90097
[40] PILANCI, M. and WAINWRIGHT, M. J. (2017). Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization 27 205-245. · Zbl 1456.90125
[41] RAMSAY, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences 4.
[42] RASKUTTI, G. and MAHONEY, M. W. (2016). A statistical perspective on randomized sketching for ordinary least-squares. Journal of Machine Learning Research 17 7508-7538. · Zbl 1436.62331
[43] TIAN, T. S. (2010). Functional Data Analysis in Brain Imaging Studies. Frontiers in Psychology 1 35-35.
[44] TROPP, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trends R◯ in Machine Learning 8 1-230. · Zbl 1391.15071
[45] TURKBROWNE, N. B. (2013). Functional interactions as big data in the human brain. Science 342 580-584.
[46] Wang, H. (2019). More Efficient Estimation for Logistic Regression with Optimal Subsamples. Journal of Machine Learning Research 20 1-59. · Zbl 1441.62194
[47] WANG, S., GITTENS, A. and MAHONEY, M. W. (2017). Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. The Journal of Machine Learning Research 18 8039-8088. · Zbl 1473.62253
[48] Wang, H., Zhu, R. and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113 829-844. · Zbl 1398.62196
[49] WOODRUFF, D. P. (2014). Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357. · Zbl 1316.65046
[50] YANG, Y., PILANCI, M., WAINWRIGHT, M. J. et al. (2017). Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics 45 991-1023. · Zbl 1371.62039
[51] YAO, F., MÜLLER, H.-G. and WANG, J.-L. (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100 577-590. · Zbl 1117.62451
[52] YAO, F., MÜLLER, H.-G., WANG, J.-L. et al. (2005). Functional linear regression analysis for longitudinal data. The Annals of Statistics 33 2873-2903. · Zbl 1084.62096
[53] YUAN, M. and CAI, T. T. (2010). A reproducing kernel Hilbert space approach to functional linear regression. The Annals of Statistics 38 3412-3444. · Zbl 1204.62074
[54] ZHANG, T., NING, Y. and RUPPERT, D. (2021). Optimal sampling for generalized linear models under measurement constraints. Journal of Computational and Graphical Statistics 30 106-114. · Zbl 07499885
[55] ZHAO, G., ZHAO, Y.-H., CHU, Y.-Q., JING, Y.-P. and DENG, L.-C. (2012). LAMOST spectral survey—An overview. Research in Astronomy and Astrophysics 12 723.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.