×

High-performance tensor contraction without transposition. (English) Zbl 1379.65024


MSC:

65F30 Other matrix algorithms (MSC2010)
15A69 Multilinear algebra, tensor calculus
65Y20 Complexity and performance of numerical algorithms
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] E. Aprà , M. Klemm, and K. Kowalski, Efficient implementation of many-body quantum chemical methods on the Intel\textregistered Xeon Phi coprocessor, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14), IEEE Press, Piscataway, NJ, 2014, pp. 674–684, ..
[2] B. W. Bader and T. G. Kolda, Algorithm 862: MATLAB tensor classes for fast algorithm prototyping, ACM Trans. Math. Software, 32 (2006), pp. 635–653, . · Zbl 1230.65054
[3] R. J. Bartlett and M. Musiał, Coupled-cluster theory in quantum chemistry, Rev. Mod. Phys., 79 (2007), pp. 291–352, .
[4] G. Belter, E. R. Jessup, I. Karlin, and J. G. Siek, Automating the generation of composed linear algebra kernels, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), Association for Computing Machinery, New York, 2009, 59, .
[5] J. A. Calvin, C. A. Lewis, and E. F. Valeev, Scalable task-based algorithm for multiplication of block-rank-sparse matrices, in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (IA\(^{{3}}\) ’15), Association for Computing Machinery, New York, 2015, 4, .
[6] J. A. Calvin and E. F. Valeev, Task-Based Algorithm for Matrix Multiplication: A Step Towards Block-Sparse Tensor Computing, preprint, [cs. DC], 2015.
[7] E. Di Napoli, D. Fabregat-Traver, G. Quintana-Orti, and P. Bientinesi, Towards an efficient use of the BLAS library for multilinear tensor contractions, Appl. Math. Comput., 235 (2014), pp. 454–468, . · Zbl 1336.65076
[8] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Software, 16 (1990), pp. 1–17, . · Zbl 0900.65115
[9] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended set of FORTRAN basic linear algebra subprograms, ACM Trans. Math. Software, 14 (1988), pp. 1–17, . · Zbl 0639.65016
[10] E. Epifanovsky, M. Wormit, T. Kus, A. Landau, D. Zuev, K. Khistyaev, P. Manohar, I. Kaliman, A. Dreuw, and A. I. Krylov, New implementation of high-level correlated methods using a general block-tensor library for high-performance electronic structure calculations, J. Comput. Chem., 34 (2013), pp. 2293–2309, .
[11] K. Goto and R. A. van de Geijn, Anatomy of high-performance matrix multiplication, ACM Trans. Math. Software, 34 (2008), 12, . · Zbl 1190.65064
[12] K. Goto and R. A. van de Geijn, High-performance implementation of the level-3 BLAS, ACM Trans. Math. Software, 35 (2008), 4, .
[13] G. Guennebaud, B. Jacob, et al., Eigen v3, 2010,
[14] M. Hanrath and A. Engels-Putzka, An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster, J. Chem. Phys., 133 (2010), 064108, .
[15] A. Hartono, Q. Lu, T. Henretty, S. Krishnamoorthy, H. Zhang, G. Baumgartner, D. E. Bernholdt, M. Nooijen, R. Pitzer, J. Ramanujam, and P. Sadayappan, Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry, J. Phys. Chem. A, 113 (2009), pp. 12715–12723, .
[16] S. Hirata, Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories, J. Phys. Chem. A, 107 (2003), pp. 9887–9897, .
[17] J. Huang, T. M. Smith, G. M. Henry, and R. A. van de Geijn, Strassen’s algorithm reloaded, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’16), Association for Computing Machinery, New York, 2016, 59, .
[18] T. Kolda and B. Bader, Tensor decompositions and applications, SIAM Rev., 51 (2009), pp. 455–500, .
[19] P. M. Kroonenberg, Applied Multiway Data Analysis, John Wiley & Sons, Hoboken, NJ, 2008. · Zbl 1160.62002
[20] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic linear algebra subprograms for FORTRAN usage, ACM Trans. Math. Software, 5 (1979), pp. 308–323, . · Zbl 0412.65022
[21] J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc, An input-adaptive and in-place approach to dense tensor-times-matrix multiply, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15), Association for Computing Machinery, New York, 2015, 76, .
[22] D. I. Lyakh, An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU, Comput. Phys. Comm., 189 (2015), pp. 84–91, .
[23] W. Ma, S. Krishnamoorthy, O. Villa, K. Kowalski, and G. Agrawal, Optimizing tensor contraction expressions for hybrid CPU-GPU execution, Cluster Comput., 16 (2011), pp. 131–155, .
[24] B. Marker, D. Batory, and R. A. van de Geijn, A case study in mechanically deriving dense linear algebra code, Int. J. High Perform. Comput. Appl., 27 (2013), pp. 440–453, .
[25] D. A. Matthews and J. F. Stanton, Non-orthogonal spin-adaptation of coupled cluster methods: A new implementation of methods including quadruple excitations, J. Chem. Phys., 142 (2015), 064108, .
[26] E. Peise, D. Fabregat-Traver, and P. Bientinesi, On the performance prediction of BLAS-based tensor contractions, in High Performance Computing Systems: Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science 8966, S. A. Jarvis, S. A. Wright, and S. D. Hammond, eds., Springer, Cham, 2014, pp. 193–212, .
[27] K. Raghavachari, G. W. Trucks, J. A. Pople, and M. Head-Gordon, A fifth-order perturbation comparison of electron correlation theories, Chem. Phys. Lett., 157 (1989), pp. 479–483, .
[28] A. Smilde, R. Bro, and P. Geladi, Multi-way Analysis: Applications in the Chemical Sciences, John Wiley & Sons, Chichester, UK, 2005.
[29] T. M. Smith, R. A. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee, Anatomy of high-performance many-threaded matrix multiplication, in Proceedings of the 28th IEEE Parallel and Distributed Processing Symposium, 2014, pp. 1049–1059, .
[30] E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Demmel, A massively parallel tensor contraction framework for coupled-cluster computations, J. Parallel Distrib. Comput., 74 (2014), pp. 3176–3190, .
[31] P. Springer and P. Bientinesi, Tensor Contraction Benchmark v0.1, (2016).
[32] P. Springer and P. Bientinesi, Design of a high-performance GEMM-like tensor-tensor multiplication, preprint, [cs. MS], 2016. · Zbl 1484.65092
[33] P. Springer, J. R. Hammond, and P. Bientinesi, TTC: A high-performance compiler for tensor transpositions, ACM Trans. Math. Software, 44 (2017), 15, . · Zbl 1484.68045
[34] J. F. Stanton, J. Gauss, J. D. Watts, and R. J. Bartlett, A direct product decomposition approach for symmetry exploitation in many-body methods. I. Energy calculations, J. Chem. Phys., 94 (1991), pp. 4334–4345, .
[35] S. van de Walt, S. C. Colbert, and G. Varoquaux, The NumPy array: A structure for efficient numerical computation, Comput. Sci. Eng., 13 (2011), pp. 22–30, .
[36] F. G. Van Zee, T. M. Smith, B. Marker, T. M. Low, R. A. van de Geijn, F. D. Igual, M. Smelyanskiy, X. Zhang, V. A. Kistler, J. A. Gunnels, and L. Killough, The BLIS framework: Experiments in portability, ACM Trans. Math. Software, 42 (2016), 12, .
[37] F. G. Van Zee and R. A. van de Geijn, BLIS: A framework for rapidly instantiating BLAS functionality, ACM Trans. Math. Software, 41 (2015), 14, . · Zbl 1347.65054
[38] M. A. O. Vasilescu and D. Terzopoulos, Multilinear analysis of image ensembles: TensorFaces, in Computer Vision—ECCV 2002, Lecture Notes in Computer Science 2350, A. Heyden, G. Sparr, M. Nielsen, and P. Johansen, eds., Springer, Berlin, 2002, pp. 447–460, . · Zbl 1034.68693
[39] T. L. Veldhuizen, Arrays in Blitz++, in Computing in Object-Oriented Parallel Environments, Lecture Notes in Computer Science 1505, D. Caromel, R. R. Oldehoeft, and M. Tholburn, eds., Springer, Berlin, 1998, pp. 223–230, .
[40] Q. Wang, X. Zhang, Y. Zhang, and Q. Yi, AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13), Association for Computing Machinery, New York, 2013, 25, .
[41] C. D. Yu, J. Huang, W. Austin, B. Xiao, and G. Biros, Performance optimization for the k-nearest neighbors kernel on x86 architectures, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15), Association for Computing Machinery, New York, 2015, 7, .
[42] X. Zhang, Q. Wang, and Y. Zhang, Model-driven level 3 BLAS performance optimization on Loongson 3A processor, in Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2012, pp. 684–691, .
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.