Yamazaki, Ichitaro; Tomov, Stanimire; Dongarra, Jack Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. (English) Zbl 1320.65046 SIAM J. Sci. Comput. 37, No. 3, C307-C330 (2015). Summary: To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingly expensive compared to the arithmetic operations. This is especially true when the input matrix is tall-skinny. Unfortunately, the orthogonality error of CholQR depends quadratically on the condition number of the input matrix, and it is numerically unstable when the matrix is ill-conditioned. To enhance the stability of CholQR, we recently used mixed-precision arithmetic; the input and output matrices are in the working precision, but some of its intermediate results are accumulated in the doubled precision. In this paper, we analyze the numerical properties of this mixed-precision CholQR. Our analysis shows that by selectively using the doubled precision, the orthogonality error of the mixed-precision CholQR only depends linearly on the condition number of the input matrix. We provide numerical results to demonstrate the improved numerical stability of the mixed-precision CholQR in practice. We then study its performance. When the target hardware does not support the desired higher precision, software emulation is needed. For example, using software-emulated double-double precision for the working 64-bit double precision, the mixed-precision CholQR requires about \(8.5\times\) more floating-point instructions than that required by the standard CholQR. On the other hand, the increase in the communication cost using the double-double precision is less significant, and our performance results on multicore CPU with a different graphics processing unit (GPU) demonstrate that the overhead of using the double-double arithmetic is decreasing on a newer architecture, where the computation is becoming less expensive compared to the communication. As a result, with a latest NVIDIA GPU, the mixed-precision CholQR was only \(1.4\times\) slower than the standard CholQR. Finally, we present case studies of using the mixed-precision CholQR within communication-avoiding variants of Krylov subspace projection methods for solving a nonsymmetric linear system of equations and for solving a symmetric eigenvalue problem, on a multicore CPU with multiple GPUs. These case studies demonstrate that by using the higher precision for this small but critical segment of the Krylov methods, we can improve not only the overall numerical stability of the solvers but also, in some cases, their performance. Cited in 20 Documents MSC: 65F05 Direct numerical methods for linear systems and matrix inversion 65Y05 Parallel numerical computation 65Y10 Numerical algorithms for specific classes of architectures 65F25 Orthogonalization in numerical linear algebra Keywords:mixed-precision; orthogonalization; GPU computation Software:CholQR; LBNL PDFBibTeX XMLCite \textit{I. Yamazaki} et al., SIAM J. Sci. Comput. 37, No. 3, C307--C330 (2015; Zbl 1320.65046) Full Text: DOI References: [1] M. Anderson, G. Ballard, J. Demmel, and K. Keutzer, {\it Communication-avoiding QR decomposition for GPUs}, in Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, Washington, DC, 2011, pp. 48-58. [2] T. Auckenthaler, T. Huckle, and R. Wittmann, {\it A blocked QR-decomposition for the parallel symmetric eigenvalue problem}, Parallel Comput., 40 (2014), pp. 186-194. [3] Z. Bai, D. Hu, and L. Reichel, {\it A Newton basis GMRES implementation}, IMA J. Numer. Anal., 14 (1994), pp. 563-581. · Zbl 0818.65022 [4] J. Barlow and A. Smoktunowicz, {\it Reorthogonalized block classical Gram-Schmidt}, Numer. Math., 123 (2013), pp. 395-423. · Zbl 1269.65042 [5] A. Björck, {\it Solving linear least squares problems by Gram-Schmidt orthogonalization}, BIT, 7 (1967), pp. 1–21. · Zbl 0183.17802 [6] E. Cuthill and J. McKee, {\it Reducing the bandwidth of sparse symmetric matrices}, in Proceedings of the 24th National ACM Conference (ACM ’69), ACM, New York, 1969, pp. 157-172. [7] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, {\it Communication-optimal parallel and sequential QR and LU factorizations}, SIAM J. Sci. Comput., 34 (2012), pp. A206–A239. · Zbl 1241.65028 [8] S. Fuller and L. Millett, {\it Future of Computing Performance: Game Over or Next Level?}, The National Academies Press, Washington, DC, 2011. [9] G. Golub and C. van Loan, {\it Matrix Computations}, 3rd ed., The Johns Hopkins University Press, Baltimore, MD, 1996. · Zbl 0865.65009 [10] S. Graham, M. Snir, and C. Patterson, {\it Getting Up to Speed: The Future of Supercomputing}, The National Academies Press, Washington, DC, 2004. [11] N. Halko, P. G. Martinsson, and J. A. Tropp, {\it Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions}, SIAM Rev., 53 (2011), pp. 217-288. · Zbl 1269.65043 [12] Y. Hida, X. Li, and D. Bailey, {\it Quad-double arithmetic: Algorithms, implementation, and application}, Tech. report LBNL-46996, Lawrence Berkeley National Laboratory, Berkeley, CA, 2000. [13] M. Hoemmen, {\it Communication-Avoiding Krylov Subspace Methods}, Ph.D. thesis, University of California-Berkeley, Berkeley, CA, 2010. [14] W. Hoffman, {\it Iterative algorithms for Gram-Schmidt orthogonalization}, Computing, 41 (1989), pp. 335-348. · Zbl 0667.65037 [15] A. Kielbasiński, {\it Analiza numeryczna algorytmu ortogonlizacji Grama-Schmidta}, Seria III: Matematyka Stosowana II, 1974 (1974), pp. 15-35. [16] C. Lanczos, {\it An iteration method for the solution of the eigenvalue problem of linear differential and integral operators}, J. Res. Natl. Bur. Standards, 45 (1950), pp. 255-281. [17] Y. Saad, {\it Iterative Methods for Sparse Linear Systems}, 2nd ed., SIAM, Philadelphia, 2003. · Zbl 1031.65046 [18] Y. Saad, {\it Numerical Methods for Large Eigenvalue Problems}, revised ed., SIAM, Philadelphia, 2011. · Zbl 1242.65068 [19] Y. Saad and M. Schultz, {\it GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems}, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856-869. · Zbl 0599.65018 [20] A. Stathopoulos and K. Orginos, {\it Computing and deflating eigenvalues while solving multiple right-hand side linear systems with an application to quantum chromodynamics}, SIAM J. Sci. Comput., 32 (2010), pp. 439-462. · Zbl 1209.65046 [21] A. Stathopoulos and K. Wu, {\it A block orthogonalization procedure with constant synchronization requirements}, SIAM J. Sci. Comput., 23 (2002), pp. 2165-2182. · Zbl 1018.65050 [22] L. N. Trefethen and D. Bau, III, {\it Numerical Linear Algebra}, SIAM, Philadelphia, 1997. · Zbl 0874.65013 [23] H. van der Vorst, {\it Iterative Krylov Methods for Large Linear Systems}, Cambridge University Press, Cambridge, UK, 2003. · Zbl 1023.65027 [24] K. Wu and H. Simon, {\it Thick-restart Lanczos method for large symmetric eigenvalue problems}, SIAM J. Matrix Anal. Appl., 22 (2000), pp. 602-616. · Zbl 0969.65030 [25] I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, {\it Improving the performance of CA-GMRES on multicores with multiple GPUs}, in Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, Washington, DC, 2014, pp. 382-391. [26] I. Yamazaki, Z. Bai, H. Simon, L.-W. Wang, and K. Wu, {\it Adaptive projection subspace dimension for the thick-restart Lanczos method}, ACM Trans. Math. Software, 37 (2010), 27. · Zbl 1364.65089 [27] I. Yamazaki, S. Tomov, T. Dong, and J. Dongarra, {\it Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs}, Tech. report UT-EECS-14-730, University of Tennessee, Knoxville; in High-Performance Computing for Computational Science (VECPAR 2014), Springer, New York, to appear. · Zbl 07631077 [28] I. Yamazaki, S. Tomov, and J. Dongarra, {\it Stability and performance of various singular value QR implementations and their case-studies with adaptive mixed-precision on multicore CPU with GPUs}, 2015. · Zbl 1320.65046 [29] I. Yamazaki and K. Wu, {\it A communication-avoiding thick-restart Lanczos method on a distributed-memory system}, in Proceedings of the Workshop on Algorithms and Programming Tools for Next-Generation High-Performance Scientific and Software (HPCC), 2011. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.