zbMATH — the first resource for mathematics

Parallel matrix multiplication: a systematic journey. (English) Zbl 1355.65060
65F30 Other matrix algorithms (MSC2010)
65Y05 Parallel numerical computation
65Y20 Complexity and performance of numerical algorithms
Full Text: DOI
[1] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar, A three-dimensional approach to parallel matrix multiplication, IBM J. Res. Develop., 39 (1995), pp. 575–582.
[2] R. C. Agarwal, F. Gustavson, and M. Zubair, A high-performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication, IBM J. Res. Develop., 38 (1994), pp. 673–681.
[3] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users’ Guide, 3rd ed., Software Environments Tools 9, SIAM, Philadelphia, 1999, .
[4] G. Ballard, Avoiding Communication in Dense Linear Algebra, Ph.D. thesis, EECS Department, University of California, Berkeley, 2013.
[5] G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, Communication lower bounds and optimal algorithms for numerical linear algebra, Acta Numer., 23 (2014), pp. 1–155.
[6] R. H. Bisseling, Parallel iterative solution of sparse linear systems on a transputer network, in Parallel Computation, The Institute of Mathematics and Its Applications Conference 46, A. E. Fincham and B. Ford, eds., Oxford University Press, Oxford, UK, 1993, pp. 253–271. · Zbl 1253.65209
[7] R. H. Bisseling and W. F. McColl, Scientific computing on bulk synchronous parallel architectures, in Technology and Foundations: Information Processing ’94, Vol. I, IFIP Transactions A 51, B. Pehrson and I. Simon, eds., Elsevier Science Publishers, Amsterdam, 1994, pp. 509–514.
[8] J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby, Efficient algorithms for all-to-all communications in multi-port message-passing systems, IEEE Trans. Parallel Distrib. Syst., 8 (1997), pp. 1143-1156.
[9] L. E. Cannon, A Cellular Computer to Implement the Kalman Filter Algorithm, Ph.D. thesis, Montana State University, Bozeman, MT, 1969.
[10] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn, Collective communication: Theory, practice, and experience, Concurrency and Computation: Practice and Experience, 19 (2007), pp. 1749–1783.
[11] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker, ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers, in Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, IEEE Press, Piscataway, NJ, 1992, pp. 120–127.
[12] J. Choi, D. W. Walker, and J. J. Dongarra, PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers, Concurrency and Computation: Practice and Experience, 6 (1994), pp. 543–570.
[13] M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick, Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays—Part \textup1, preprint, 2013, .
[14] C. Edwards, P. Geng, A. Patra, and R. van de Geijn, Parallel Matrix Distributions: Have We Been Doing It All Wrong?, Technical Report TR-95-40, Department of Computer Sciences, The University of Texas at Austin, Austin, TX, 1995.
[15] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors. Vol. \textupI, Prentice Hall, Englewood Cliffs, NJ, 1988.
[16] K. Goto and R. A. van de Geijn, Anatomy of high-performance matrix multiplication, ACM Trans. Math. Softw., 34 (2008), 12. · Zbl 1190.65064
[17] J. Gunnels, C. Lin, G. Morrow, and R. van de Geijn, A flexible class of parallel matrix multiplication algorithms, in Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998), IEEE Press, Piscataway, NJ, 1998, pp. 110–116.
[18] J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn, FLAME: Formal linear algebra methods environment, ACM Trans. Math. Softw., 27 (2001), pp. 422–455. · Zbl 1070.65522
[19] B. Hendrickson, R. Leland, and S. Plimpton, An efficient parallel algorithm for matrix-vector multiplication, Int. J. High Speed Comput., 7 (1995), pp. 73–88.
[20] B. A. Hendrickson and D. E. Womble, The torus-wrap mapping for dense matrix calculations on massively parallel computers, SIAM J. Sci. Comput., 15 (1994), pp. 1201–1226, . · Zbl 0808.65148
[21] S. Huss-Lederman, E. Jacobson, and A. Tsao, Comparison of scalable parallel matrix multiplication libraries, in Proceedings of the Scalable Parallel Libraries Conference, IEEE Press, Piscataway, NJ, 1993, pp. 142–149.
[22] S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, Matrix multiplication on the Intel Touchstone DELTA, Concurrency and Computation: Practice and Experience, 6 (1994), pp. 571–594.
[23] D. Irony, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput., 64 (2004), pp. 1017–1026. · Zbl 1114.68081
[24] J. G. Lewis and R. A. van de Geijn, Implementing matrix-vector multiplication and conjugate gradient algorithms on distributed memory multicomputers, in Proceedings of the Scalable High-Performance Computing Conference, IEEE Press, Piscataway, NJ, 1994, pp. 542–550.
[25] J. Li, A. Skjellum, and R. D. Falgout, A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies, Concurrency and Computation: Practice and Experience, 9 (1997), pp. 345–389.
[26] L. H. Loomis and H. Whitney, An inequality related to the isoperimetric inequality, Bull. Amer. Math. Soc., 55 (1949), pp. 961–962. · Zbl 0035.38302
[27] W. F. McColl and A. Tiskin, Memory-efficient matrix multiplication in the BSP model, Algorithmica, 24 (1999), pp. 287–297. · Zbl 0943.68066
[28] J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero, Elemental: A new framework for distributed memory dense matrix computations, ACM Trans. Math. Softw., 39 (2013), 13. · Zbl 1295.65137
[29] J. Poulson, R. van de Geijn, and J. Bennighof, Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem, FLAME Working Note 56, Technical Report TR-11-05, Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX, 2011.
[30] M. D. Schatz, Anatomy of Parallel Computation with Tensors, Technical Report TR-13-21, Department of Computer Science, The University of Texas at Austin, Austin, TX, 2013.
[31] E. Solomonik and J. Demmel, Communication-optimal parallel \textup2.5d matrix multiplication and lu factorization algorithms, in Proceedings of the 17th International Conference on Parallel Processing (Euro-Par 2011), Part II, Lecture Notes in Comput. Sci. 6853, Springer, Berlin, 2011, pp. 90–109.
[32] G. W. Stewart, Communication and matrix computations on large message passing systems, Parallel Comput., 16 (1990), pp. 27–40. · Zbl 0713.65106
[33] R. van de Geijn and J. Watts, SUMMA: Scalable universal matrix multiplication algorithm, Concurrency and Computation: Practice and Experience, 9 (1997), pp. 255–274.
[34] R. A. van de Geijn, Using PLAPACK: Parallel Linear Algebra Package, The MIT Press, Cambridge, MA, 1997.
[35] F. G. Van Zee, \textitlibflame: The Complete Reference, Lulu Press, Raleigh, NC, 2009.
[36] F. G. Van Zee, E. Chan, R. A. van de Geijn, E. S. Quintana-Ortí, and G. Quintana-Ortí, The libflame library for dense matrix computations, IEEE Comput. Sci. Engrg., 11 (2009), pp. 56–63.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.