×

\(e\)PCA: high dimensional exponential family PCA. (English) Zbl 1411.62376

Summary: Many applications involve large datasets with entries from exponential family distributions. Our main motivating application is photon-limited imaging, where we observe images with Poisson distributed pixels. We focus on X-ray Free Electron Lasers (XFEL), a quickly developing technology whose goal is to reconstruct molecular structure. In XFEL, estimating the principal components of the noiseless distribution is needed for denoising and for structure determination. However, the standard method, Principal Component Analysis (PCA), can be inefficient in non-Gaussian noise.
Motivated by this application, we develop \(e\)PCA (exponential family PCA), a new methodology for PCA on exponential families. \(e\)PCA is a fast method that can be used very generally for dimension reduction and denoising of large data matrices with exponential family entries.
We conduct a substantive XFEL data analysis using \(e\)PCA. We show that \(e\)PCA estimates the PCs of the distribution of images more accurately than PCA and alternatives. Importantly, it also leads to better denoising. We also provide theoretical justification for our estimator, including the convergence rate and the Marchenko-Pastur law in high dimensions. An open-source implementation is available.

MSC:

62P35 Applications of statistics to physics
62H35 Image analysis in multivariate analysis
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol.11 R106.
[2] Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, Hoboken, NJ. · Zbl 1039.62044
[3] Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika35 246-254. · Zbl 0032.03702
[4] Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer, New York. · Zbl 1301.60002
[5] Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab.33 1643-1697. · Zbl 1086.15022
[6] Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal.97 1382-1408. · Zbl 1220.15011
[7] Bartholomew, D. J. and Knott, M. (1999). Latent Variable Models and Factor Analysis, 2nd ed. Kendall’s Library of Statistics7. Edward Arnold, London. · Zbl 1066.62528
[8] Basri, R. and Jacobs, D. W. (2003). Lambertian reflectance and linear subspaces. IEEE Trans. Pattern Anal. Mach. Intell.25 218-233.
[9] Benaych-Georges, F. and Nadakuditi, R. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Adv. Math.227 494-521. · Zbl 1226.15023
[10] Bergmann, U., Yachandra, V. and Yano, J., eds. (2017). X-Ray Free Electron Lasers. The Royal Society of Chemistry, Croydon.
[11] Bhamre, T., Zhang, T. and Singer, A. (2016). Denoising and covariance estimation of single particle cryo-EM images. Journal of Structural Biology195 72-81.
[12] Bigot, J., Deledalle, C. and Féral, D. (2016). Generalized SURE for optimal shrinkage of singular values in low-rank matrix denoising. Preprint. Available at arXiv:1605.07412. · Zbl 1440.94012
[13] Boucheron, S., Bousquet, O., Lugosi, G. and Massart, P. (2005). Moment inequalities for functions of independent random variables. Ann. Probab.33 514-560. · Zbl 1074.60018
[14] Cao, Y. and Xie, Y. (2014). Low-rank matrix recovery in Poisson noise. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on 384-388. IEEE, New York.
[15] Chen, X. and Storey, J. D. (2015). Consistent estimation of low-dimensional latent structure in high-dimensional data. Preprint. Available at arXiv:1510.03497.
[16] Collins, M., Dasgupta, S. and Schapire, R. (2001). A generalization of principal component analysis to the exponential family. Advances in Neural Information Processing Systems (NIPS).
[17] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science41 391-407.
[18] Dobriban, E. (2015). Efficient computation of limit spectra of sample covariance matrices. Random Matrices Theory Appl.4 1550019, 36. · Zbl 1330.65029
[19] Dobriban, E. (2017). Sharp detection in PCA under correlations: All eigenvalues matter. Ann. Statist.45 1810-1833. · Zbl 1486.62182
[20] Donoho, D., Gavish, M. and Johnstone, I. (2013). Optimal shrinkage of eigenvalues in the spiked covariance model. Preprint. Available at arXiv:1311.0851. · Zbl 1403.62099
[21] Favre-Nicolin, V., Baruchel, J., Renevier, H., Eymery, J. and Borbély, A. (2015). XTOP: High-resolution X-ray diffraction and imaging. Journal of Applied Crystallography48 620-620.
[22] Freeman, M. F. and Tukey, J. W. (1950). Transformations related to the angular and the square root. Ann. Math. Stat.21 607-611. · Zbl 0039.35304
[23] Furnival, T., Leary, R. K. and Midgley, P. A. (2017). Denoising time-resolved microscopy image sequences with singular value thresholding. Ultramicroscopy178 112-124.
[24] Hantke, M. F., Ekeberg, T. and Maia, F. R. N. C. (2016). Condor: A simulation tool for Flash X-Ray imaging. Journal of Applied Crystallography49 1356-1362.
[25] Huber, P., Ronchetti, E. and Victoria-Feser, M.-P. (2004). Estimation of generalized linear latent variable models. J. R. Stat. Soc. Ser. B. Stat. Methodol.66 893-908. · Zbl 1060.62077
[26] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist.29 295-327. · Zbl 1016.62078
[27] Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York. · Zbl 1011.62064
[28] Josse, J. and Wager, S. (2016). Bootstrap-based regularization for low-rank matrix estimation. J. Mach. Learn. Res.17 1-29. · Zbl 1392.62153
[29] Kam, Z. (1977). Determination of macromolecular structure in solution by spatial correlation of scattering fluctuations. Macromolecules10 927-934.
[30] Kam, Z. (1980). The reconstruction of structure from electron micrographs of randomly oriented particles. J. Theoret. Biol.82 15-39.
[31] Kurta, R. P., Donatelli, J. J., Yoon, C. H. et al. (2017). Correlations in scattered X-Ray laser pulses reveal nanoscale structural features of viruses. Phys. Rev. Lett.119 158102.
[32] Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal.88 365-411. · Zbl 1032.62050
[33] Lee, S., Zou, F. and Wright, F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Ann. Statist.38 3605-3629. · Zbl 1204.62097
[34] Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York. · Zbl 1076.62018
[35] Li, J. and Tao, D. (2010). Simple exponential family PCA. In AISTATS 453-460.
[36] Liu, L. T, Dobriban, E. and Singer, A. (2018). Supplement to “\(e\) PCA: High dimensional exponential family PCA.” DOI:10.1214/18-AOAS1146SUPP. · Zbl 1411.62376
[37] Maia, F. R. N. C. and Hajdu, J. (2016). The trickle before the torrent-diffraction data from X-ray lasers. Sci. Data3 160059.
[38] Mäkitalo, M. and Foi, A. (2011). Optimal inversion of the Anscombe transformation in low-count Poisson image denoising. IEEE Trans. Image Process.20 99-109. · Zbl 1372.94173
[39] Marčenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues in certain sets of random matrices. Mat. Sb.72 507-536. · Zbl 0152.16101
[40] Martin, A. V., Wang, F., Loh, N. D., Ekeberg, T. et al. (2012). Noise-robust coherent diffractive imaging with a single diffraction pattern. Opt. Express20 16650-16661.
[41] Nadakuditi, R. R. (2014). OptShrink: An algorithm for improved low-rank signal matrix denoising by optimal, data-driven singular value shrinkage. IEEE Trans. Inform. Theory60 3002-3018. · Zbl 1360.62399
[42] Nowak, R. D. and Baraniuk, R. G. (1999). Wavelet-domain filtering for photon imaging systems. IEEE Trans. Image Process.8 666-678.
[43] Pande, K., Schwander, P., Schmidt, M. and Saldin, D. (2014). Deducing fast electron density changes in randomly orientated uncrystallized biomolecules in a pump-probe experiment. Philos. Trans. R. Soc. Lond. B, Biol. Sci.369 20130332.
[44] Pande, K., Schmidt, M., Schwander, P. and Saldin, D. K. (2015). Simulations on time-resolved structure determination of uncrystallized biomolecules in the presence of shot noise. Struct. Dyn.2 024103.
[45] Patterson, N., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet.2 e190.
[46] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica17 1617-1642. · Zbl 1134.62029
[47] Saldin, D. K., Shneerson, V. L., Fung, R. and Ourmazd, A. (2009). Structure of isolated biomolecules obtained from ultrashort x-ray pulses: Exploiting the symmetry of random orientations. J. Phys., Condens. Matter21 134014.
[48] Schwander, P., Giannakis, D., Yoon, C. H. and Ourmazd, A. (2012). The symmetries of image formation by scattering. II. Applications. Opt. Express20 12827-12849.
[49] Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components. Wiley, New York.
[50] Shabalin, A. A. and Nobel, A. B. (2013). Reconstruction of a low-rank matrix in the presence of Gaussian noise. J. Multivariate Anal.118 67-76. · Zbl 1280.15022
[51] Starck, J.-L., Murtagh, F. and Fadili, J. M. (2010). Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity. Cambridge Univ. Press, Cambridge. · Zbl 1196.94008
[52] Starodub, D., Aquila, A., Bajt, S. et al. (2012). Single-particle structure determination by correlations of snapshot X-ray diffraction patterns. Nat. Commun.3.
[53] Stegle, O., Teichmann, S. A. and Marioni, J. C. (2015). Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet.16 133-145.
[54] Tropp, J. A. (2016). The expected norm of a sum of independent random matrices: An elementary approach. In High Dimensional Probability VII. Progress in Probability71 173-202. Springer, Cham. · Zbl 1382.60016
[55] Udell, M., Horn, C., Zadeh, R. and Boyd, S. (2014). Generalized low rank models. In NIPS Workshop on Distributed Machine Learning and Matrix Computations. · Zbl 1350.68221
[56] Udell, M., Horn, C., Zadeh, R. and Boyd, S. (2016). Generalized low rank models. Found. Trends Mach. Learn.9 1-118. · Zbl 1350.68221
[57] Visscher, P. M., Brown, M. A., McCarthy, M. I. and Yang, J. (2012). Five years of GWAS discovery. Am. J. Hum. Genet.90 7-24.
[58] Yao, J., Zheng, S. and Bai, Z. (2015). Large Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge Series in Statistical and Probabilistic Mathematics39. Cambridge Univ. Press, New York. · Zbl 1380.62011
[59] Zhao, Z., Shkolnisky, Y. and Singer, A. (2016). Fast steerable principal component analysis. IEEE Trans. Comput. Imaging2 1-12.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.