×

zbMATH — the first resource for mathematics

Implementing high-performance complex matrix multiplication via the 1M method. (English) Zbl 07271860
MSC:
65Y04 Numerical algorithms for computer arithmetic, etc.
65-04 Software, source code, etc. for problems pertaining to numerical analysis
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff, A set of level \textup3 basic linear algebra subprograms, ACM Trans. Math. Software, 16 (1990), pp. 1-17. · Zbl 0900.65115
[2] G. Frison, D. Kouzoupis, T. Sartor, A. Zanelli, and M. Diehl, BLASFEO: Basic linear algebra subroutines for embedded optimization, ACM Trans. Math. Software, 44 (2018), 42, https://doi.org/10.1145/3210754. · Zbl 07003067
[3] K. Goto and R. A. van de Geijn, Anatomy of high-performance matrix multiplication, ACM Trans. Math. Software, 34 (2008), 12, https://doi.org/10.1145/1356052.1356053. · Zbl 1190.65064
[4] K. Goto and R. A. van de Geijn, High-performance implementation of the level-3 BLAS, ACM Trans. Math. Software, 35 (2008), 4, https://doi.org/10.1145/1377603.1377607.
[5] J. A. Gunnels, G. M. Henry, and R. A. van de Geijn, A family of high-performance matrix multiplication algorithms, in Proceedings of the International Conference on Computational Sciences-Part I (ICCS ’01), Springer-Verlag, Berlin, Heidelberg, 2001, pp. 51-60, https://doi.org/10.1007/3-540-45545-0_15. · Zbl 0982.68505
[6] N. J. Higham, Stability of a method for multiplying complex matrices with three real matrix multiplications, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 681-687, https://doi.org/10.1137/0613043. · Zbl 0777.65027
[7] J. Huang, Practical Fast Matrix Multiplication Algorithms, Ph.D. thesis, The University of Texas at Austin, Austin, TX, 2018.
[8] J. Huang, D. A. Matthews, and R. A. van de Geijn, Strassen’s algorithm for tensor contraction, SIAM J. Sci. Comput., 40 (2018), pp. C305-C326, https://doi.org/10.1137/17M1135578. · Zbl 1416.65117
[9] J. Huang, L. Rice, D. A. Matthews, and R. A. van de Geijn, Generating families of practical fast matrix multiplication algorithms, in Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), 2017, pp. 656-667, https://doi.org/10.1109/IPDPS.2017.56.
[10] J. Huang, T. M. Smith, G. M. Henry, and R. A. van de Geijn, Strassen’s algorithm reloaded, in Proceedings of the IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’16), Piscataway, NJ, 2016, 59, https://doi.org/10.1109/SC.2016.58.
[11] J. Huang, C. D. Yu, and R. A. van de Geijn, Implementing Strassen’s algorithm with CUTLASS on NVIDIA Volta GPUs, FLAME Working Note #88, TR-18-08, Department of Computer Science, The University of Texas at Austin, Austin, TX, 2018, https://apps.cs.utexas.edu/apps/sites/default/files/tech_reports/GPUStrassen.pdf.
[12] Intel, Math Kernel Library, https://software.intel.com/en-us/mkl, 2019.
[13] Intel Corporation, Intel 64 and IA-32 Architectures Optimization Reference Manual, no. 248966-033, June 2016.
[14] Intel Corporation, Intel Xeon Processor E5 v3 Product Family: Processor Specification Update, no. 330785-010US, September 2016.
[15] T. M. Low, F. D. Igual, T. M. Smith, and E. S. Quintana-Ortí, Analytical modeling is enough for high-performance BLIS, ACM Trans. Math. Software, 43 (2016), 12, https://doi.org/10.1145/2925987. · Zbl 1369.65200
[16] OpenBLAS, http://xianyi.github.com/OpenBLAS/, 2019.
[17] D. T. Popovici, F. Franchetti, and T. M. Low, Mixed data layout kernels for vectorized complex arithmetic, in Proceedings of the 2017 IEEE High Performance Extreme Computing Conference (HPEC), 2017, pp. 1-7, https://doi.org/10.1109/HPEC.2017.8091024.
[18] T. M. Smith, R. A. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee, Anatomy of high-performance many-threaded matrix multiplication, in Proceedings of the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS’14), Washington, D.C., 2014, pp. 1049-1059, https://doi.org/10.1109/IPDPS.2014.110.
[19] F. G. Van Zee, Inducing Complex Matrix Multiplication via the 1M Method, FLAME Working Note #85, TR-17-03, Department of Computer Science, The University of Texas at Austin, Austin, TX, 2017.
[20] F. G. Van Zee, T. Smith, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. Gunnels, T. M. Low, B. Marker, L. Killough, and R. A. van de Geijn, The BLIS framework: Experiments in portability, ACM Trans. Math. Sofware, 42 (2016), 12, https://doi.org/10.1145/2755561.
[21] F. G. Van Zee and T. M. Smith, Implementing high-performance complex matrix multiplication via the 3M and 4M methods, ACM Trans. Math. Software, 44 (2017), 7. · Zbl 06920069
[22] F. G. Van Zee and R. A. van de Geijn, BLIS: A framework for rapidly instantiating BLAS functionality, ACM Trans. Math. Software, 41 (2015), 14, https://doi.org/10.1145/2764454. · Zbl 1347.65054
[23] R. C. Whaley, A. Petitet, and J. J. Dongarra, Automated empirical optimization of software and the ATLAS project, Parallel Comput., 27 (2001), pp. 3-35, https://doi.org/10.1016/S0167-8191(00)00087-9. · Zbl 0971.68033
[24] C. D. Yu, J. Huang, W. Austin, B. Xiao, and G. Biros, Performance optimization for the k-nearest neighbors kernel on x86 architectures, in Proceedings of the ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15), New York, NY, 2015, 7, https://doi.org/10.1145/2807591.2807601.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.