×

Communication lower bounds and optimal algorithms for numerical linear algebra. (English) Zbl 1396.65082

Summary: The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical \((O(n^3))\) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

MSC:

65Fxx Numerical linear algebra
65Y20 Complexity and performance of numerical algorithms
65-02 Research exposition (monographs, survey articles) pertaining to numerical analysis
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aasen, J. O., On the reduction of a symmetric matrix to tridiagonal form, BIT Numer. Math., 11, 233-242, (1971) · Zbl 0242.65032
[2] Abdelmalek, N. N., Round off error analysis for Gram-Schmidt method and solution of linear least squares problems, BIT Numer. Math., 11, 345-367, (1971) · Zbl 0236.65031
[3] Agarwal, R.; Balle, S.; Gustavson, F.; Joshi, M.; Palkar, P., A three-dimensional approach to parallel matrix multiplication, IBM J. Res. Dev., 39, 575-582, (1995)
[4] Agarwal, R.; Gustavson, F.; Zubair, M., A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication, IBM J. Res. Dev., 38, 673-681, (1994)
[5] Aggarwal, A.; Vitter, J., The input/output complexity of sorting and related problems, Comm. Assoc. Comput. Mach., 31, 1116-1127, (1988)
[6] Aggarwal, A.; Chandra, A. K.; Snir, M., Communication complexity of prams, Theoret. Comput. Sci., 71, 3-28, (1990) · Zbl 0699.68054
[7] Ahmed, N.; Pingali, K., Proc. 6th International Euro-Par Conference on Parallel Processing, Automatic generation of block-recursive codes, 368-378, (2000), Springer
[8] Anderson, E.; Bai, Z.; Bischof, C.; Demmel, J.; Dongarra, J.; Croz, J. D.; Greenbaum, A.; Hammarling, S.; Mckenney, A.; Ostrouchov, S.; Sorensen, D., LAPACK Users’ Guide, (1992), SIAM
[9] Anderson, M.; Ballard, G.; Demmel, J.; Keutzer, K., Proc. 2011 IEEE International Parallel and Distributed Processing Symposium: IPDPS ’11, Communication-avoiding QR decomposition for GPUs, 48-58, (2011)
[10] Arnoldi, W. E., The principle of minimized iterations in the solution of the matrix eigenvalue problem, Quart. Appl. Math., 9, 17-29, (1951) · Zbl 0042.12801
[11] Bai, Z.; Day, D.; Bai, Z.; Demmel, J. W.; Dongarra, J. J.; Ruhe, A.; Van Der Vorst, H., Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, Block Arnoldi method, 196-204, (2000), SIAM
[12] Bai, Z.; Day, D.; Demmel, J.; Dongarra, J., A test matrix collection for non-Hermitian eigenvalue problems, (1997), University of Tennessee
[13] Bai, Z.; Demmel, J.; Gu, M., An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems, Numer. Math., 76, 279-308, (1997) · Zbl 0876.65021
[14] Bai, Z.; Hu, D.; Reichel, L., Proc. Fifth SIAM Conference on Parallel Processing for Scientific Computing, An implementation of the GMRES method using QR factorization, 84-91, (1991)
[15] Bai, Z.; Hu, D.; Reichel, L., A Newton basis GMRES implementation, IMA J. Numer. Anal., 14, 563-581, (1994) · Zbl 0818.65022
[16] Ballard, G., Avoiding communication in dense linear algebra, (2013), EECS Department, UC Berkeley
[17] Ballard, G.; Becker, D.; Demmel, J.; Dongarra, J.; Druinsky, A.; Peled, I.; Schwartz, O.; Toledo, S.; Yamazaki, I., Communication-avoiding symmetric-indefinite factorization. Technical report UCB/EECS-2013-127, (2013), EECS Department, UC Berkeley
[18] Ballard, G.; Becker, D.; Demmel, J.; Dongarra, J.; Druinsky, A.; Peled, I.; Schwartz, O.; Toledo, S.; Yamazaki, I., Proc. 27th IEEE International Parallel Distributed Processing Symposium: IPDPS ’13, Implementing a blocked Aasen’s algorithm with a dynamic scheduler on multicore architectures, 895-907, (2013)
[19] Ballard, G.; Buluç, A.; Demmel, J.; Grigori, L.; Lipshitz, B.; Schwartz, O.; Toledo, S., Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA ’13, Communication optimal parallel multiplication of sparse random matrices, 222-231, (2013), ACM
[20] Ballard, G.; Demmel, J.; Dumitriu, I., Communication-optimal parallel and sequential eigenvalue and singular value algorithms, UC Berkeley
[21] Ballard, G.; Demmel, J.; Gearhart, A., Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA ’11, Brief announcement: Communication bounds for heterogeneous architectures, 257-258, (2011), ACM
[22] Ballard, G.; Demmel, J.; Knight, N., Proc. 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: PPoPP ’12, Communication avoiding successive band reduction, 35-44, (2012), ACM
[23] Ballard, G.; Demmel, J.; Knight, N., Avoiding communication in successive band reduction, (2013), EECS Department, UC Berkeley
[24] Ballard, G.; Demmel, J.; Grigori, L.; Jacquelin, M.; Nguyen, H. D.; Solomonik, E., Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS ’14, Reconstructing Householder vectors from Tall-Skinny QR, (2014)
[25] Ballard, G.; Demmel, J.; Holtz, O.; Schwartz, O., Communication-optimal parallel and sequential Cholesky decomposition, SIAM J. Sci. Comput., 32, 3495-3523, (2010) · Zbl 1238.65018
[26] Ballard, G.; Demmel, J.; Holtz, O.; Schwartz, O., Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA ’11, Graph expansion and communication costs of fast matrix multiplication, 1-12, (2011), ACM
[27] Ballard, G.; Demmel, J.; Holtz, O.; Schwartz, O., Minimizing communication in numerical linear algebra, SIAM J. Matrix Anal. Appl., 32, 866-901, (2011) · Zbl 1246.68128
[28] Ballard, G.; Demmel, J.; Holtz, O.; Schwartz, O., Graph expansion and communication costs of fast matrix multiplication, J. Assoc. Comput. Mach., 59, (2012) · Zbl 1281.68241
[29] Ballard, G.; Demmel, J.; Holtz, O.; Schwartz, O., Sequential communication bounds for fast linear algebra, (2012), UC Berkeley
[30] Ballard, G.; Demmel, J.; Holtz, O.; Lipshitz, B.; Schwartz, O., Proc. 24th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA ’12, Brief announcement: Strong scaling of matrix multiplication algorithms and memory independent communication lower bounds, 77-79, (2012), ACM
[31] Ballard, G.; Demmel, J.; Holtz, O.; Lipshitz, B.; Schwartz, O., Proc. 24th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA ’12, Communication-optimal parallel algorithm for Strassen’s matrix multiplication, 193-204, (2012), ACM
[32] Ballard, G.; Demmel, J.; Holtz, O.; Lipshitz, B.; Schwartz, O.; Even, G.; Rawitz, D., Design and Analysis of Algorithms, 7659, Graph expansion analysis for communication costs of fast rectangular matrix multiplication, 13-36, (2012), Springer · Zbl 1385.68057
[33] Ballard, G.; Demmel, J.; Lipshitz, B.; Schwartz, O.; Toledo, S., Proc. 25th ACM Symposium on Parallelism in Algorithms and Architectures: SPAA ’13, Communication efficient Gaussian elimination with partial pivoting using a shape morphing data layout, 232-240, (2013), ACM
[34] Barrett, R.; Berry, M.; Chan, T. F.; Demmel, J. W.; Donato, J.; Dongarra, J. J.; Eijkhout, V.; Pozo, R.; Romine, C.; Van Der Vorst, H., Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, (1994), SIAM
[35] Bender, M. A.; Brodal, G. S.; Fagerberg, R.; Jacob, R.; Vicari, E., Optimal sparse matrix dense vector multiplication in the I/O-model, Theory Comput. Syst., 47, 934-962, (2010) · Zbl 1213.68069
[36] Bennett, J.; Carbery, A.; Christ, M.; Tao, T., Finite bounds for holder-brascamp-Lieb multilinear inequalities, Math. Res. Lett., 17, 647-666, (2010) · Zbl 1247.26029
[37] Berntsen, J., Communication efficient matrix multiplication on hypercubes, Parallel Comput., 12, 335-342, (1989) · Zbl 0689.65024
[38] Bilardi, G.; Preparata, F. P., Processor-time tradeoffs under bounded-speed message propagation II: lower bounds, Theory Comput. Syst., 32, 531-559, (1999) · Zbl 0951.68003
[39] Bilardi, G.; Pietracaprina, A.; D’Alberto, P.; Brandes, U.; Wagner, D., Graph-Theoretic Concepts in Computer Science: 26th International Workshop, 1928, On the space and access complexity of computation DAGs, 47-58, (2000), Springer
[40] Bischof, C.; Loan, C. Van, The WY representation for products of Householder matrices, SIAM J. Sci. Statist. Comput., 8, 2-13, (1987) · Zbl 0628.65033
[41] Bischof, C. H.; Lang, B.; Sun, X., Algorithm 807: the SBR toolbox, software for successive band reduction, ACM Trans. Math. Softw., 26, 602-616, (2000) · Zbl 1365.65104
[42] Bischof, C.; Lang, B.; Sun, X., A framework for symmetric band reduction, ACM Trans. Math. Softw., 26, 581-601 · Zbl 1365.65103
[43] Björck, A., Solving linear least squares problems by Gram-Schmidt orthogonalization, BIT Numer. Math., 7, 1-21, (1967) · Zbl 0183.17802
[44] Blackford, L. S.; Choi, J.; Cleary, A.; D’Azevedo, E.; Demmel, J.; Dhillon, I.; Dongarra, J.; Hammarling, S.; Henry, G.; Petitet, A.; Stanley, K.; Walker, D.; Whaley, R. C., ScaLAPACK Users’ Guide, (1997), SIAM
[45] Blackford, L. S.; Demmel, J.; Dongarra, J.; Duff, I.; Hammarling, S.; Henry, G.; Heroux, M.; Kaufman, L.; Lumsdaine, A.; Petitet, A.; Pozo, R.; Remington, K.; Whaley, R. C., An updated set of basic linear algebra subroutines (BLAS), J. ACM Trans. Math. Softw., 28, 135-151, (2002) · Zbl 1070.65520
[46] Börm, S.; Grasedyck, L., (2006)
[47] Börm, S.; Grasedyck, L.; Hackbusch, W., (2004)
[48] Braman, K.; Byers, R.; Mathias, R., The multishift QR algorithm I: maintaining well-focused shifts and level 3 performance, SIAM J. Matrix Anal. Appl., 23, 929-947, (2002) · Zbl 1017.65031
[49] Braman, K.; Byers, R.; Mathias, R., The multishift QR algorithm II: aggressive early deflation, SIAM J. Matrix Anal. Appl., 23, 948-973, (2002) · Zbl 1017.65032
[50] Bunch, J.; Kaufman, L., Some stable methods for calculating inertia and solving symmetric linear systems, Math. Comp., 31, 163-179, (1977) · Zbl 0355.65023
[51] Buttari, A.; Langou, J.; Kurzak, J.; Dongarra, J. J., (2007)
[52] Byun, J.; Lin, R.; Yelick, K.; Demmel, J., Autotuning sparse matrix-vector multiplication for multicore, (2012), EECS Department, UC Berkeley
[53] Cannon, L., A cellular computer to implement the Kalman filter algorithm, (1969), Montana State University
[54] Carson, E.; Demmel, J., A residual replacement strategy for improving the maximum attainable accuracy of s-step Krylov subspace methods, SIAM J. Matrix Anal. Appl., 35, 22-43, (2014) · Zbl 1302.65075
[55] Carson, E.; Knight, N.; Demmel, J., Avoiding communication in non-symmetric Lanczos-based Krylov subspace methods, SIAM J. Sci. Comput., 35, S42-S61, (2013) · Zbl 1281.65057
[56] Catalyurek, U. V.; Aykanat, C., Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication, IEEE Trans. Parallel Distributed Systems, 10, 673-693, (1999)
[57] Catalyiirek, Ü. V.; Aykanat, C., Proc. 15th IEEE International Parallel and Distributed Processing Symposium: IPDPS ’01, A fine-grain hypergraph model for 2D decomposition of sparse matrices, (2001)
[58] Chan, E.; Heimlich, M.; Purkayastha, A.; Van De Geijn, R., Collective communication: theory, practice, and experience, Concurrency and Computation: Pract. Exp., 19, 1749-1783, (2007)
[59] Chow, E., apriori sparsity patterns for parallel sparse approximate inverse preconditioners, SIAM J. Sci. Comput., 21, 1804-1822, (2000) · Zbl 0957.65023
[60] Chow, E., Parallel implementation and practical use of sparse approximate inverses with a priori sparsity patterns, Internat. J. High Perf. Comput. Appl., 15, 56-74, (2001)
[61] Christ, M.; Demmel, J.; Knight, N.; Scanlon, T.; Yelick, K., Communication lower bounds and optimal algorithms for programs that reference arrays, (2013), EECS Department, UC Berkeley
[62] Chronopoulos, A.; Gear, C., On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy, Parallel Comput., 11, 37-53, (1989) · Zbl 0679.65020
[63] Chronopoulos, A.; Gear, C., s-step iterative methods for symmetric linear systems, J. Comput. Appl. Math., 25, 153-168, (1989) · Zbl 0669.65021
[64] Chronopoulos, A.; Swanson, C., Parallel iterative s-step methods for unsymmetric linear systems, Parallel Comput., 22, 623-641, (1996) · Zbl 0873.65019
[65] Cohen, E., Proc. 35th Ann. Symp. Found. Comp. Sci., Estimating the size of the transitive closure in linear time, 190-200, (1994), IEEE
[66] Cohen, E., Size-estimation framework with applications to transitive closure and reachability, J. Comput. System Sci., 55, 441-453, (1997) · Zbl 0897.68075
[67] Frontiers in Massive Data Analysis, (2013), The National Academies Press
[68] Da Cunha, R. D.; Becker, D.; Patterson, J. C., Euro-Par 2002 Parallel Processing, New parallel (rank-revealing) QR factorization algorithms, 677-686, (2002), Springer · Zbl 1068.65504
[69] Datta, K.; Murphy, M.; Volkov, V.; Williams, S.; Carter, J.; Oliker, L.; Patterson, D.; Shalf, J.; Yelick, K., Proc. 2008 ACM/IEEE Conference on Supercomputing, Stencil computation optimization and autotuning on state-of-the-art multicore architectures, 4, (2008), IEEE Press
[70] Davis, T.; Hu, Y., The university of florida sparse matrix collection, ACM Trans. Math. Softw., 38, 1-25, (2011) · Zbl 1365.65123
[71] Dekel, E.; Nassimi, D.; Sahni, S., Parallel matrix and graph algorithms, SIAM J. Comput., 10, 657-675, (1981) · Zbl 0468.68044
[72] Demmel, J., Applied Numerical Linear Algebra, (1997), SIAM · Zbl 0879.65017
[73] Demmel, J., An arithmetic complexity lower bound for computing rational functions, with applications to linear algebra, (2013), EECS Department, UC Berkeley
[74] Demmel, J.; Dumitriu, I.; Holtz, O., Fast linear algebra is stable, Numer. Math., 108, 59-91, (2007) · Zbl 1133.65015
[75] Demmel, J.; Dumitriu, I.; Holtz, O.; Kleinberg, R., Fast matrix multiplication is stable, Numer. Math., 106, 199-224, (2007) · Zbl 1134.65030
[76] Demmel, J.; Eliahu, D.; Fox, A.; Kamil, S.; Lipshitz, B.; Schwartz, O.; Spillinger, O., Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS ’13, Communication-optimal parallel recursive rectangular matrix multiplication, 261-272, (2013)
[77] Demmel, J.; Gearhart, A.; Lipshitz, B.; Schwartz, O., Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS ’13, Perfect strong scaling using no additional energy, 649-660, (2013)
[78] Demmel, J.; Grigori, L.; Gu, M.; Xiang, H., Communication avoiding rank revealing QR factorization with column pivoting, (2013), EECS Department, UC Berkeley · Zbl 1327.65078
[79] Demmel, J.; Grigori, L.; Hoemmen, M.; Langou, J., (2008)
[80] Demmel, J.; Grigori, L.; Hoemmen, M.; Langou, J., Communication-optimal parallel and sequential QR and LU factorizations, SIAM J. Sci. Comput., 34, A206-A239, (2012) · Zbl 1241.65028
[81] Demmel, J.; Hoemmen, M.; Mohiyuddin, M.; Yelick, K., Avoiding communication in computing Krylov subspaces, (2007), EECS Department, UC Berkeley
[82] Demmel, J.; Hoemmen, M.; Mohiyuddin, M.; Yelick, K., Proc. 2008 IEEE International Parallel and Distributed Processing Symposium: IPDPS 2008, Avoiding communication in sparse matrix computations, 1-12, (2008)
[83] Demmel, J.; Marques, O.; Parlett, B.; Vomel, C., Performance and accuracy of LAPACK’s symmetric tridiagonal eigensolvers, SIAM J. Sci. Comput., 30, 1508-1526, (2008) · Zbl 1165.65014
[84] Devine, K. D.; Boman, E. G.; Heaphy, R. T.; Bisseling, R. H.; Catalyurek, U. V., Proc. 20th IEEE International Parallel and Distributed Processing Symposium: IPDPS 2006, Parallel hypergraph partitioning for scientific computing, (2006)
[85] Dongarra, J. J.; Croz, J. D.; Duff, I. S.; Hammarling, S., Algorithm 679: A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Softw., 16, 18-28, (1990) · Zbl 0900.65116
[86] Dongarra, J. J.; Croz, J. D.; Duff, I. S.; Hammarling, S., A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Softw., 16, 1-17, (1990) · Zbl 0900.65115
[87] Dongarra, J. J.; Croz, J. D.; Hammarling, S.; Hanson, R. J., Algorithm 656: an extended set of Fortran basic linear algebra subprograms, ACM Trans. Math. Softw., 14, 18-32, (1988) · Zbl 0639.65017
[88] Dongarra, J. J.; Croz, J. D.; Hammarling, S.; Hanson, R. J., An extended set of Fortran basic linear algebra subprograms, ACM Trans. Math. Softw., 14, 1-17, (1988) · Zbl 0639.65016
[89] Dongarra, J. J.; Moler, C. B.; Bunch, J. R.; Stewart, G. W., LINPACK Users’ Guide, (1979), SIAM · Zbl 0476.68025
[90] Douglas, C. C.; Hu, J.; Kowarschik, M.; Rüde, U.; Weiβ, C., Cache optimization for structured and unstructured grid multigrid, Electron. Trans. Numer. Anal., 10, 21-40, (2000) · Zbl 0949.65099
[91] Driscoll, M.; Georganas, E.; Koanantakool, P.; Solomonik, E.; Yelick, K., Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS ’13, A communication-optimal W-body algorithm for direct interactions, 1075-1084, (2013)
[92] Dursun, H.; Nomura, K.-I.; Peng, L.; Seymour, R.; Wang, W.; Kalia, R. K.; Nakano, A.; Vashishta, P., Euro-Par 2009 Parallel Processing, A multilevel parallelization framework for highorder stencil computations, 642-653, (2009), Springer
[93] Elmroth, E.; Gustavson, F.; Kågström, B., Applied Parallel Computing: Large Scale Scientific and Industrial Problems, 1541, New serial and parallel recursive QR factorization algorithms for SMP systems, 120-128, (1998), Springer
[94] Floyd, R., Algorithm 97: shortest path, Commun. Assoc. Comput. Mach., 5, 345, (1962)
[95] Frens, J. D.; Wise, D. S., QR factorization with morton-ordered quadtree matrices for memory re-use and parallelism, ACM SIGPLAN Notices, 38, 144-154, (2003)
[96] Frigo, M.; Strumpen, V., Proc. 19th Annual International Conference on Supercomputing, Cache oblivious stencil computations, 361-366, (2005), ACM
[97] Frigo, M.; Strumpen, V., The cache complexity of multithreaded cache oblivious algorithms, Theory Comput. Syst., 45, 203-233, (2009) · Zbl 1183.68721
[98] Frigo, M.; Leiserson, C. E.; Prokop, H.; Ramachandran, S., Proc. 40th Annual Symposium on Foundations of Computer Science: FOCS ’99, Cache-oblivious algorithms, 285-297, (1999), IEEE Computer Society
[99] Fuller, S. H.; Millett, L. I., The Future of Computing Performance: Game Over or Next Level?, (2011), The National Academies Press
[100] Gannon, D.; Rosendale, J. Van, On the impact of communication complexity on the design of parallel numerical algorithms, Trans. Comput., 100, 1180-1194, (1984) · Zbl 0546.68028
[101] Georganas, E.; Gonzalez-Dominguez, J.; Solomonik, E.; Zheng, Y.; Tourino, J.; Yelick, K., Proc. International Conference for High Performance Computing, Networking, Storage and Analysis: SC ’12, Communication avoiding and overlapping for numerical linear algebra, 1-11, (2012)
[102] George, A., Nested dissection of a regular finite element mesh, SIAM J. Numer. Anal., 10, 345-363, (1973) · Zbl 0259.65087
[103] Gilbert, J. R.; Tarjan, R. E., The analysis of a nested dissection algorithm, Numer. Math., 377-404, (1987) · Zbl 0645.65012
[104] Giraud, L.; Langou, J., A robust criterion for the modified Gram-Schmidt algorithm with selective reorthogonalization, SIAM J. Sci. Comput., 25, 417-441, (2003) · Zbl 1042.65033
[105] Giraud, L.; Langou, J.; Rozloznik, M., The loss of orthogonality in the Gram-Schmidt orthogonalization process, Comput. Math. Appl., 50, 1069-1075, (2005) · Zbl 1085.65037
[106] Golub, G.; Loan, C. Van, Matrix Computations, (1996), Johns Hopkins University Press · Zbl 0865.65009
[107] Golub, G. H.; Plemmons, R. J.; Sameh, A., High-Speed Computing: Scientific Applications and Algorithm Design, Parallel block schemes for large-scale least-squares computations, 171-179, (1988), University of Illinois Press
[108] Graham, S. L.; Snir, M.; Patterson, C. A., Getting up to Speed: The Future of Supercomputing, (2004), The National Academies Press
[109] Granat, R.; Kågström, B.; Kressner, D.; Shao, M., Parallel library software for the multishift QR algorithm with aggressive early deflation, (2012), Umea University · Zbl 1347.65070
[110] Greenbaum, A., Estimating the attainable accuracy of recursively computed residual methods, SIAM J. Matrix Anal. Appl., 18, 535-551, (1997) · Zbl 0873.65027
[111] Greenbaum, A., Iterative Methods for Solving Linear Systems, (1997), SIAM · Zbl 0883.65022
[112] Greenbaum, A.; Rozložník, M.; Strakoš, Z., Numerical behavior of the modified Gram-Schmidt GMRES implementation, BIT Numer. Math., 37, 706-719, (1997) · Zbl 0891.65031
[113] Greiner, G.; Jacob, R., Mathematical Foundations of Computer Science 2010, Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model, 393-404, (2010), Springer · Zbl 1287.68055
[114] Greiner, G.; Jacob, R., LATIN 2010: Theoretical Informatics, The I/O complexity of sparse matrix dense matrix multiplication, 143-156, (2010), Springer · Zbl 1283.68163
[115] Grigori, L.; Moufawad, S., Communication avoiding ILU(0) preconditioned Research report RR-8266, (2013), INRIA
[116] Grigori, L.; David, P.-Y.; Demmel, J.; Peyronnet, S., Proc. 22nd ACM Symposium on Parallelism in Algorithms and Architectures: SPAA ’10, Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem, 79-81, (2010), ACM
[117] Grigori, L.; Demmel, J.; Xiang, H., CALU: A communication optimal LU factorization algorithm, SIAM J. Matrix Anal. Appl., 32, 1317-1350, (2011) · Zbl 1242.65089
[118] Gu, M.; Eisenstat, S., Efficient algorithms for computing a strong rank-revealing QR factorization, SIAM J. Sci. Comput., 17, 848-869, (1996) · Zbl 0858.65044
[119] Gunter, B. C.; Van De Geijn, R. A., Parallel out-of-core computation and updating of the QR factorization, ACM Trans. Math. Softw., 31, 60-78, (2005) · Zbl 1073.65023
[120] Gustavson, F. G., Recursion leads to automatic variable blocking for dense linear-algebra algorithms, IBM J. Res. Dev., 41, 737-756, (1997)
[121] Gutknecht, M., Acta Numerica, 6, Lanczos-type solvers for nonsymmetric linear systems of equations, 271-398, (1997), Cambridge University Press · Zbl 0888.65030
[122] Gutknecht, M.; Ressel, K., Look-ahead procedures for Lanczos-type product methods based on three-term Lanczos recurrences, SIAM J. Matrix Anal. Appl., 21, 1051-1078, (2000) · Zbl 0961.65025
[123] Gutknecht, M.; Strakoš, Z., Accuracy of two three-term and three two-term recurrences for Krylov space solvers, SIAM J. Matrix Anal. Appl., 22, 213-229, (2000) · Zbl 0976.65030
[124] Hackbusch, W., (2006)
[125] Haidar, A.; Luszczek, P.; Kurzak, J.; Dongarra, J., (2013)
[126] Hestenes, M. R.; Stiefel, E., Methods of conjugate gradients for solving linear systems, J. Res. Nat. Bur. Standards, 49, 409-436, (1952) · Zbl 0048.09901
[127] Hindmarsh, A.; Walker, H., Note on a Householder implementation of the GMRES method, (1986), Lawrence Livermore National Laboratory
[128] Hoemmen, M., Communication-avoiding Krylov subspace methods, (2010), EECS Department, UC Berkeley
[129] Hoffman, A. J.; Martin, M. S.; Rose, D. J., Complexity bounds for regular finite difference and finite element grids, SIAM J. Numer. Anal., 10, 364-369, (1973) · Zbl 0261.65026
[130] Hong, J. W.; Kung, H. T., Proc. 13th Annual ACM Symposium on Theory of Computing: STOC ’81, I/O complexity: The red-blue pebble game, 326-333, (1981), ACM
[131] Howell, G. W.; Demmel, J.; Fulton, C. T.; Hammarling, S.; Marmol, K., Cache efficient bidiagonalization using BLAS 2.5 operators, ACM Trans. Math. Softw., 34, (2008) · Zbl 1190.65056
[132] Hupp, P.; Jacob, R.; Dehne, F.; Solis-Oba, R.; Sack, J.-R., Algorithms and Data Structures, 8037, Tight bounds for low dimensional star stencils in the external memory model, 415-426, (2013), Springer · Zbl 1391.68010
[133] Irigoin, F.; Triolet, R., Proc. 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Supernode partitioning, 319-329, (1988), ACM
[134] Irony, D.; Toledo, S.; Tiskin, A., Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput., 64, 1017-1026, (2004) · Zbl 1114.68081
[135] Jalby, W.; Philippe, B., Stability analysis and improvement of the block Gram-Schmidt algorithm, SIAM J. Sci. Statist. Comput., 12, 1058-1073, (1991) · Zbl 0734.65034
[136] Johnsson, S. L., Parallel Comput, Minimizing the communication time for matrix multiplication on multiprocessors, (1992)
[137] Joubert, W.; Carey, G., Parallelizable restarted iterative methods for nonsymmetric linear systems I: theory, Internat. J. Comput. Math., 44, 243-267, (1992) · Zbl 0759.65008
[138] Kågström, B.; Kressner, D.; Shao, M., Applied Parallel and Scientific Computing, On aggressive early deflation in parallel variants of the QR algorithm, 1-10, (2012), Springer
[139] Karlsson, L.; Kågström, B., Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures, Parallel Comput., 37, 771-782, (2011) · Zbl 1248.65043
[140] Kepner, J.; Gilbert, J., Graph Algorithms in the Language of Linear Algebra, 22, (2011), SIAM
[141] Kielbasinski, A., Roczniki Polskiego Towarzystwa Matematycznego, Seria III: Matematyka Stosowana II, Numerical analysis of the Gram-Schmidt orthogonalization algorithm (analiza numeryczna algorytmu ortogonalizacji Grama-Schmidta), 15-35, (1974)
[142] Kim, S.; Chronopoulos, A., An efficient nonsymmetric Lanczos method on parallel vector computers, J. Comput. Appl. Math., 42, 357-374, (1992) · Zbl 0756.65057
[143] Knight, N.; Carson, E.; Demmel, J., Proc. PPAM ’13, 8384, Exploiting data sparsity in parallel matrix powers computations, (2014), Springer
[144] Kolda, T. G.; Bader, B. W., Tensor decompositions and applications, SIAM Review, 51, 455-500, (2009) · Zbl 1173.65029
[145] Lamielle, A.; Strout, M., Enabling code generation within the sparse polyhedral framework, (2010), Colorado State University
[146] Lanczos, C., An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, (1950), United States Government Press Office · Zbl 0067.33703
[147] Lawson, C. L.; Hanson, R. J.; Kincaid, D.; Krogh, F. T., Basic linear algebra subprograms for Fortran usage, ACM Trans. Math. Softw., 5, 308-323, (1979) · Zbl 0412.65022
[148] Leiserson, C. E.; Rao, S.; Toledo, S., Efficient out-of-core algorithms for linear relaxation using blocking covers, J. Comput. System Sci., 54, 332-344, (1997) · Zbl 0877.68063
[149] Lev, G.; Valiant, L., Size bounds for superconcentrators, Theor. Comput. Sci., 22, 233-251, (1983) · Zbl 0497.68022
[150] Lipshitz, B., Communication-avoiding parallel recursive algorithms for matrix multiplication, (2013), EECS Department, UC Berkeley
[151] Lipshitz, B.; Ballard, G.; Demmel, J.; Schwartz, O., Proc. International Conference on High Performance Computing, Networking, Storage and Analysis: SC ’12, Communication-avoiding parallel Strassen: Implementation and performance, (2012)
[152] Loomis, L. H.; Whitney, H., An inequality related to the isoperimetric inequality, Bull. Amer. Math. Soc., 55, 961-962, (1949) · Zbl 0035.38302
[153] Mccoll, W.; Tiskin, A., Memory-efficient matrix multiplication in the BSP model, Algorithmica, 24, 287-297, (1999) · Zbl 0943.68066
[154] Meurant, G., The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations, (2006), SIAM · Zbl 1110.65029
[155] Meurant, G.; Strakos, Z., Acta Numerica, 15, The Lanczos and conjugate gradient algorithms in finite precision arithmetic, 471-542, (2006), Cambridge University Press · Zbl 1113.65032
[156] Mohanty, S.; Gopalan, S., 2012 19th International Conference on High Performance Computing: HiPC, I/O efficient QR and QZ algorithms, 1-9, (2012)
[157] Mohiyuddin, M., Tuning hardware and software for multiprocessors, (2012), EECS Department, UC Berkeley
[158] Mohiyuddin, M.; Hoemmen, M.; Demmel, J.; Yelick, K., Proc. International Conference on High Performance Computing Networking, Storage and Analysis: SC ’09, Minimizing communication in sparse matrix solvers, (2009)
[159] Nakatsukasa, Y.; Higham, N., Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD, (2012), University of Manchester
[160] Nishtala, R.; Vuduc, R. W.; Demmel, J. W.; Yelick, K. A., When cache blocking of sparse matrix vector multiply works and why, Applicable Algebra in Engineering, Communication and Computing, 18, 297-311, (2007) · Zbl 1122.65043
[161] Park, J.-S.; Penner, M.; Prasanna, V., Optimizing graph algorithms for improved cache performance, IEEE Trans. Parallel Distributed Systems, 15, 769-782, (2004)
[162] Parlett, B., Acta Numerica, 4, The new QD algorithms, 459-491, (1995), Cambridge University Press · Zbl 0835.65059
[163] Parlett, B.; Reid, J., On the solution of a system of linear equations whose matrix is symmetric but not definite, BIT Numer. Math., 10, 386-397, (1970)
[164] Parlett, B.; Taylor, D.; Liu, Z., A look-ahead Lanczos algorithm for unsymmetric matrices, Math. Comp., 44, 105-124, (1985) · Zbl 0564.65022
[165] Pfeifer, C., Data flow and storage allocation for the PDQ-5 program on the Philco-2000, (1963), Pittsburgh
[166] Philippe, B.; Reichel, L., On the generation of Krylov subspace bases, Appl. Numer. Math., 62, 1171-1186, (2012) · Zbl 1253.65049
[167] Puglisi, C., Modification of the Householder method based on compact WY representation, SIAM J. Sci. Statist. Comput., 13, 723-726, (1992) · Zbl 0756.65040
[168] Reichel, L., Newton interpolation at Leja points, BIT, 30, 332-346, (1990) · Zbl 0702.65012
[169] Rozloznik, M.; Shklarski, G.; Toledo, S., Partitioned triangular tridiagonalization, ACM Trans. Math. Softw., 37, (2011) · Zbl 1365.65074
[170] Saad, Y., Practical use of polynomial preconditionings for the conjugate gradient method, SIAM J. Sci. Statist. Comput., 6, 865-881, (1985) · Zbl 0601.65019
[171] Saad, Y., Iterative Methods for Sparse Linear Systems, (2003), SIAM · Zbl 1002.65042
[172] Saad, Y.; Schultz, M. H., GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Statist. Comput., 7, 856-869, (1986) · Zbl 0599.65018
[173] Saad, Y.; Yeung, M.; Erhel, J., A deflated version of the conjugate gradient algorithm, SIAM J. Sci. Comput., 21, 1909-1926, (2000) · Zbl 0955.65021
[174] Savage, J. E., Computing and Combinatorics, 959, Extending the Hong-Kung model to memory hierarchies, 270-281, (1995), Springer
[175] Schatz, M.; Poulson, J.; Van De Geijn, R., Scalable universal matrix multiplication algorithms: 2D and 3D variations on a theme, (2013), University of Texas
[176] Schreiber, R.; Loan, C. Van, A storage-efficient WY representation for products of Householder transformations, SIAM J. Sci. Statist. Comput., 10, 53-57, (1989) · Zbl 0664.65025
[177] Schwarz, H. A., Über einen grenzübergang durch alternierendes verfahren, Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich, 15, 272-286, (1870)
[178] Scquizzato, M.; Silvestri, F.; Mayr, E. W.; Portier, N., 31st International Symposium on Theoretical Aspects of Computer Science: STACS 2014, 25, Communication lower bounds for distributed-memory computations, 627-638, (2014) · Zbl 1359.68027
[179] Sleijpen, G.; Van Der Vorst, H., Reliable updated residuals in hybrid bi-CG methods, Computing, 56, 141-163, (1996) · Zbl 0842.65018
[180] Smith, B. T.; Boyle, J. M.; Dongarra, J. J.; Garbow, B. S.; Ikebe, Y.; Klema, V. C.; Moler, C. B., Matrix Eigensystem Routines: EISPACK Guide, (1976), Springer · Zbl 0325.65016
[181] Smoktunowicz, A.; Barlow, J. L.; Langou, J., A note on the error analysis of classical Gram-Schmidt, Numer. Math., 105, 299-313, (2006) · Zbl 1108.65021
[182] Solomonik, E.; Demmel, J.; Jeannot, E.; Namyst, R.; Roman, J., Euro-Par 2011 Parallel Processing, 6853, Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms, 90-109, (2011), Springer
[183] Solomonik, E.; Buluc, A.; Demmel, J., Proc. 27th IEEE International Parallel Distributed Processing Symposium: IPDPS ’13, Minimizing communication in all-pairs shortest-paths, 548-559, (2013)
[184] Solomonik, E.; Carson, E.; Knight, N.; Demmel, J., Tradeoffs between synchronization, communication, and work in parallel linear algebra computations, (2014), EECS Department, UC Berkeley
[185] Solomonik, E.; Matthews, D.; Hammond, J.; Demmel, J., Proc. 27th IEEE International Parallel and Distributed Processing Symposium: IPDPS ’13, Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions, 813-824, (2013)
[186] Sorensen, D., Analysis of pairwise pivoting in Gaussian elimination, IEEE Trans. Computers, 34, 274-278, (1985) · Zbl 0551.65014
[187] Sorensen, D., Implicit application of polynomial filters in a k-step Arnoldi method, SIAM J. Matrix Anal. Appl., 13, 357-385, (1992) · Zbl 0763.65025
[188] Stathopoulos, A.; Wu, K., A block orthogonalization procedure with constant synchronization requirements, SIAM J. Sci. Comput., 23, 2165-2182, (2002) · Zbl 1018.65050
[189] Stewart, G., Block Gram-Schmidt orthogonalization, SIAM J. Sci. Comput., 31, 761-775, (2008) · Zbl 1185.65069
[190] Strassen, V., Gaussian elimination is not optimal, Numer. Math., 13, 354-356, (1969) · Zbl 0185.40101
[191] Strout, M. M.; Carter, L.; Ferrante, J., Computational Science: ICCS 2001, Rescheduling for locality in sparse matrix computations, 137-146, (2001), Springer
[192] Sturler, E., A performance model for Krylov subspace methods on meshbased parallel computers, Parallel Comput., 22, 57-74, (1996) · Zbl 0873.65017
[193] Tang, Y.; Chowdhury, R. A.; Kuszmaul, B. C.; Luk, C.-K.; Leiserson, C. E., Proc. 23rd ACM Symposium on Parallelism in Algorithms and Architectures, The Pochoir stencil compiler, 117-128, (2011), ACM
[194] Thakur, R.; Rabenseifner, R.; Gropp, W., Optimization of collective communication operations in MPICH, Internat. J. High Performance Comput. Appl., 19, 49-66, (2005)
[195] Tiskin, A., Bulk-synchronous parallel Gaussian elimination, J. Math. Sci., 108, 977-991, (2002) · Zbl 0997.65047
[196] Tiskin, A., Communication-efficient parallel generic pairwise elimination, Future Generation Computer Systems, 23, 179-188, (2007)
[197] Toledo, S., Quantitative performance modeling of scientific computations and creating locality in numerical algorithms, (1995), MIT
[198] Toledo, S., Locality of reference in LU decomposition with partial pivoting, SIAM J. Matrix Anal. Appl., 18, 1065-1081, (1997) · Zbl 0890.65025
[199] Tong, C.; Ye, Q., Analysis of the finite precision bi-conjugate gradient algorithm for nonsymmetric linear systems, Math. Comp., 69, 1559-1576, (2000) · Zbl 0953.65017
[200] Trefethen, L.; Schreiber, R., Average-case stability of Gaussian elimination, SIAM J. Matrix Anal. Appl., 11, 335-360, (1990) · Zbl 0703.65015
[201] Van De Geijn, R. A.; Watts, J., SUMMA: scalable universal matrix multiplication algorithm, Concurrency: Pract. Exp., 9, 255-274, (1997)
[202] Der Vorst, H. Van; Ye, Q., Residual replacement strategies for Krylov subspace iterative methods for the convergence of true residuals, SIAM J. Sci. Comput., 22, 835-852, (1999) · Zbl 0983.65039
[203] Rosendale, J. Van, Minimizing inner product data dependencies in conjugate gradient iteration, (1983), ICASE-NASA
[204] Vanderstraeten, D., Euro-Par99 Parallel Processing, A stable and efficient parallel block Gram-Schmidt algorithm, 1128-1135, (1999), Springer
[205] Vuduc, R.; Demmel, J.; Yelick, K., Proc. of SciDAC 2005, J. of Physics Conference Series, OSKI: A library of automatically tuned sparse matrix kernels, (2005), Institute of Physics
[206] Vuduc, R. W., Automatic performance tuning of sparse matrix kernels, (2003), EECS Department, UC Berkeley
[207] Walker, H., Implementation of the GMRES method using Householder transformations, SIAM J. Sci. Statist. Comput., 9, 152-163, (1988) · Zbl 0698.65021
[208] Warshall, S., A theorem on Boolean matrices, J. Assoc. Comput. Mach., 9, 11-12, (1962) · Zbl 0118.33104
[209] Williams, S.; Lijewski, M.; Almgren, A.; Straalen, B. Van; Carson, E.; Knight, N.; Demmel, J., Proc. 2014 IEEE International Parallel and Distributed Processing Symposium: IPDPS ’14, s-step Krylov subspace methods as bottom solvers for geometric multigrid, (2014)
[210] Williams, S.; Oliker, L.; Vuduc, R.; Shalf, J.; Yelick, K.; Demmel, J., Optimization of sparse matrix-vector multiplication on emerging multicore platforms, Parallel Comput., 35, 178-194, (2009)
[211] Williams, V., Proc. 44th Annual Symposium on Theory of Computing: STOC 12, Multiplying matrices faster than Coppersmith-Winograd, 887-898, (2012), ACM · Zbl 1286.65056
[212] Wise, D.; Bode, A.; Ludwig, T.; Karl, W.; Wismüoller, R., Euro-Par 2000 Parallel Processing, 1900, Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free, 774-783, (2000), Springer
[213] Wolf, M. M.; Boman, E. G.; Hendrickson, B., PARA08, Trondheim, Norway, Optimizing parallel sparse matrix-vector multiplication by corner partitioning, (2008)
[214] Xia, J.; Chandrasekaran, S.; Gu, M.; Li, X. S., Fast algorithms for hierarchically semiseparable matrices, Numer. Linear Algebra Appl., 17, 953-976, (2010) · Zbl 1240.65087
[215] Yotov, K.; Roeder, T.; Pingali, K.; Gunnels, J.; Gustavson, F., Proc. 19th Annual ACM Symposium on Parallel Algorithms and Architectures: SPAA ’07, An experimental comparison of cache-oblivious and cache-conscious programs, 93-104, (2007), ACM
[216] Yzelman, A.; Bisseling, R. H., Two-dimensional cache-oblivious sparse matrix-vector multiplication, Parallel Comput., 37, 806-819, (2011)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.