×

Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs. (English) Zbl 1493.65282

Summary: This paper presents the implementation, performance, and energy consumption of accurate and mixed-precision linear algebra kernels, including inner-product (DOT), dense matrix-vector multiplication (GEMV), dense matrix multiplication (GEMM), and sparse matrix-vector multiplication (SpMV) for the compressed sparse row (CSR) format (CSRMV), on graphics processing units (GPUs). We employ a mixed-precision design in our implementation, which makes it possible to perform internal floating-point operations with at least 2-fold the precision of the input and output data precision: for binary32 data, the computation is performed on binary64, and for binary64 data, the computation is performed on 2-fold the precision with an accurate inner product algorithm referred to as Dot2. We developed highly optimized implementations which can achieve performance close to the upper bound performance. From our evaluation on Titan V, a Volta architecture GPU, we made the following observations: as the Dot2 operation consumes 11 times binary64 instructions, GEMM requires the corresponding overheads (in terms of both execution time and energy consumption), compared to the standard binary64 implementation. On the other hand, the accuracy of DOT, GEMV, and CSRMV is improved with a very small overhead to the execution time and up to roughly 30% overhead to the energy requirement.

MSC:

65Y10 Numerical algorithms for specific classes of architectures
65Fxx Numerical linear algebra
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Lawson, C.; Hanson, R.; Kincaid, D.; Krogh, F., Basic linear algebra subprograms for fortran usage, ACM Trans. Math. Software, 5, 308-323 (1979) · Zbl 0412.65022
[2] Ogita, T.; Rump, S. M.; Oishi, S., Accurate sum and dot product, SIAM J. Sci. Comput., 26, 1955-1988 (2005) · Zbl 1084.65041
[3] Kogge, P.; Borkar, S.; Campbell, D.; Carlson, W.; Dally, W.; Denneau, M.; Franzon, P.; Harrod, W.; Hiller, J.; Keckler, S.; Klein, D.; Lucas, R., ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems, Vol. 15Tech. Rep. (2008), Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Techinal Representative
[4] Li, X. S.; Demmel, J. W.; Bailey, D. H.; Henry, G.; Hida, Y.; Iskandar, J.; Kahan, W.; Kapur, A.; Martin, M. C.; Tung, T.; Yoo, D. J., Design, implementation and testing of extended and mixed precision BLAS, ACM Trans. Math. Software, 28, 2, 152-205 (2000) · Zbl 1070.65523
[5] M. Nakata, The MPACK; Multiple precision arithmetic BLAS (MBLAS) and LAPACK (MLAPACK), http://mplapack.sourceforge.net.
[6] Hida, Y.; Li, X. S.; Bailey, D. H., Library for Double-Double and Quad-Double ArithmeticTech. Rep. (2007), NERSC Division, Lawrence Berkeley National Laboratory
[7] Dekker, T. J., A floating-point technique for extending the available precision, Numer. Math., 18, 224-242 (1971) · Zbl 0226.65034
[8] Y. Hida, X.S. Li, D.H. Bailey, Algorithms for quad-double precision floating point arithmetic, in: Proc. 15th IEEE Symposium on Computer Arithmetic, ARITH-15, 2001, pp. 155-162.
[9] Fousse, L.; Hanrot, G.; Lefèvre, V.; Pélissier, P.; Zimmermann, P., MPFR: A multiple-precision binary floating-point library with correct rounding, ACM Trans. Math. Software, 33, 2, 13:1-13:15 (2007) · Zbl 1365.65302
[10] Demmel, J.; Ahrens, P.; Nguyen, H. D., Efficient Reproducible Floating Point Summation and BLASTech. Rep. UCB/EECS-2016-121 (2016), EECS Department, University of California: EECS Department, University of California Berkeley
[11] R. Iakymchuk, S. Collange, D. Defour, S. Graillat, ExBLAS: Reproducible and accurate BLAS library, in: Proc. Numerical Reproducibility At Exascale (NRE2015) At SC’15, 2015. · Zbl 1354.65082
[12] D. Mukunoki, D. Takahashi, Implementation and evaluation of triple precision BLAS subroutines on GPUs, in: Proc. IEEE 26th International Parallel and Distributed Processing Symposium Workshops & Ph.D. Forum, IPDPSW 2012, 2012, pp. 1378-1386.
[13] D. Mukunoki, D. Takahashi, Using quadruple precision arithmetic to accelerate krylov subspace methods on GPUs, in: Proc. 10th International Conference on Parallel Processing and Applied Mathematics, PPAM 2013, 2014, pp. 632-642.
[14] Knuth, D. E., The Art of Computer Programming Vol. 2 Seminumerical Algorithms (1969), Addison-Wesley · Zbl 0191.18001
[15] Karp, A. H.; Markstein, P., High-precision division and square root, ACM Trans. Math. Software, 23, 561-589 (1997) · Zbl 0912.65038
[16] D. Mukunoki, T. Imamura, D. Takahashi, Fast implementation of general matrix-vector multiplication (GEMV) on Kepler GPUs, in: Proc. 23rd Euromicro International Conference on Parallel, Distributed and Network-Based, PDP 2015, 2015, pp. 642-650.
[17] N. Bell, M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, in: Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, SC’09, no. 18, 2009, pp. 1-11.
[18] Nath, R.; Tomov, S.; Dongarra, J., An improved magma gemm for fermi graphics processing units, Int. J. High Perform. Comput. Appl., 24, 4, 511-515 (2010)
[19] Williams, S.; Waterman, A.; Patterson, D., Roofline: An insightful visual performance model for multicore architectures, Commun. ACM, 52, 4, 65-76 (2009)
[20] Davis, T. A.; Hu, Y., The University of Florida sparse matrix collection, ACM Trans. Math. Software, 38, 1, 1:1-1:25 (2011) · Zbl 1365.65123
[21] Li, A.; Hammad Mazhar, R. S.; Negrut, D., Comparison of SPMV performance on matrices with different matrix format using CUSP, cuSPARSE and ViennaCLTech. Rep. TR-2015-02 (2015), University of Wisconsin: University of Wisconsin Madison
[22] Ogita, T.; Rump, S. M.; Oishi, S., Accurate sum and dot product, SIAM J. Sci. Comput., 26, 6, 1955-1988 (2005) · Zbl 1084.65041
[23] Carson, E.; Higham, N., Accelerating the solution of linear systems by iterative refinement in three precisions, SIAM J. Sci. Comput., 40, 2, A817-A847 (2018) · Zbl 1453.65067
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.