ViennaCL-linear algebra library for multi- and many-core architectures. (English) Zbl 1349.65740


65Y15 Packaged methods for numerical algorithms
65F10 Iterative numerical methods for linear systems
65F50 Computational methods for sparse matrices
65Y10 Numerical algorithms for specific classes of architectures
Full Text: DOI


[1] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, J. Phys. Conf. Ser., 180 (2009), 012037.
[2] J. I. Aliaga, J. Pérez, E. S. Quintana-Ortí, and H. Anzt, Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs, in Proceedings of the Internatioal Conference on Parallel Processing, IEEE, Piscataway, NJ, 2013, pp. 320–329.
[3] J. I. Aliaga, J. Pérez, and E. S. Quintana-Ortí, Systematic fusion of CUDA kernels for iterative sparse linear system solvers, in Euro-Par 2015: Parallel Processing, Lecture Notes in Comput. Sci. 9233, Springer, Heidelberg, 2015, pp. 675–686.
[4] H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra, Optimizing Krylov subspace solvers on graphics processing units, in IEEE International Parallel and Distributed Processing Symposium Workshops, IEEE, Piscataway, NJ, 2014, pp. 941–949.
[5] A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, Fast sparse matrix-vector multiplication on GPUs for graph applications, in International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, Piscataway, NJ, 2014, pp. 781–792.
[6] W. Bangerth and T. Heister, What makes computational open source software libraries successful?, Comput. Sci. Discov., 6 (2013), 015010/1.
[7] W. Bangerth and T. Heister, Quo Vadis, Scientific Software?, SIAM News, 47 (2014), 1.
[8] M. M. Baskaran and R. Bordawekar, Optimizing sparse matrix-vector multiplication on GPUs, Technical report RC24704, IBM 2008.
[9] N. Bell, S. Dalton, and L. N. Olson, Exposing fine-grained parallelism in algebraic multigrid methods, SIAM J. Sci. Comput., 34 (2012), pp. C123–C152. · Zbl 1253.65041
[10] N. Bell and M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, in International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, New York, 2009, 18.
[11] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, Cilk: An efficient multithreaded runtime system, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, New York, 1995, pp. 207–216.
[12] J. Brown, M. G. Knepley, and B. F. Smith, Run-time extensibility and librarization of simulation software, Comput. Sci. Eng., 17 (2015), pp. 38–45.
[13] J. Brown, HPGMG: Benchmarking computers using multigrid, Copper Mountain Multigrid Conference 2015, (2015).
[14] A. Buluç and J. R. Gilbert, Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments, SIAM J. Sci. Comput., 34 (2012), pp. C170–C191.
[15] E. Chow, H. Anzt, and J. Dongarra, Asynchronous iterative algorithm for computing incomplete factorizations on GPUs, in High Performance Computing, Lecture Notes in Comput. Sci. 9137, Springer, Cham, 2015, pp. 1–16.
[16] E. Chow and A. Patel, Fine-grained parallel incomplete LU factorization, SIAM J. Sci. Comput., 37 (2015), pp. C169–C193. · Zbl 1320.65048
[17] L. Dagum and R. Menon, OpenMP: An industry standard API for shared-memory programming, IEEE Comput. Sci. Eng., 5 (1998), pp. 46–55.
[18] S. Dalton, N. Bell, L. Olson, and M. Garland, Cusp: Generic parallel algorithms for sparse matrix and graph computations, Version 0.5.1, (2014).
[19] T. A. Davis and Y. Hu, The University of Florida sparse matrix collection, ACM Trans. Math. Software, 38 (2011), pp. 1:1–1:25. · Zbl 1365.65123
[20] D. Demidov, K. Ahnert, K. Rupp, and P. Gottschling, Programming CUDA and OpenCL: A case study using modern C++ libraries, SIAM J. Sci. Comput., 35 (2013), pp. C453–C472. · Zbl 1311.65179
[21] Message Passing Forum, MPI: A Message-Passing Interface Standard, Technical report, University of Tennessee, Knoxville, TN, 1994.
[22] R. Gandham, K. Esler, and Y. Zhang, A GPU accelerated aggregation algebraic multigrid method, Comp. Math. Appl., 68 (2014), pp. 1151–1160. · Zbl 1367.65049
[23] P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose, Hiding global communication latency in the GMRES algorithm on massively parallel machines, SIAM J. Sci. Comput., 35 (2013), pp. C48–C71. · Zbl 1273.65050
[24] P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned conjugate gradient algorithm, Parallel Comput., 40 (2014), pp. 224–238.
[25] J. R. Gilbert, V. B. Shah, and S. Reinhardt, A unified framework for numerical and combinatorial computing, Comput. Sci. Eng., 10 (2008), pp. 20–25.
[26] P. Gottschling and C. Steinhardt, Meta-Tuning in MTL\(4\), International Conference of Numerical Analysis and Applied Mathematics, AIP Conf. Proc. 1281 (2010), pp. 778–782.
[27] J. L. Greathouse and M. Daga, Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format, in International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, Piscataway, NJ, 2014, pp. 769–780.
[28] F. Gremse, A. Höfter, L. O. Schwen, F. Kiessling, and U. Naumann, GPU-accelerated sparse matrix-matrix multiplication by iterative row merging, SIAM J. Sci. Comput., 37 (2015), pp. C54–C71. · Zbl 1327.65090
[30] V. Heuveline, D. Lukarski, and J.-Ph. Weiss, Enhanced parallel ILU(p)-based preconditioners for multi-core CPUs and GPUs – The power(q)-pattern method, Preprint Series of the Engineering Mathematics and Computing Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2011.
[31] J. Hoberock and N. Bell, Thrust: A Parallel Template Library, (2010).
[32] T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, and R. Thakur, MPI + MPI: A new hybrid approach to parallel programming with MPI plus shared memory, Computing, 95 (2013), pp. 1121–1136.
[33] J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, Addison-Wesley, Upper Saddle River, NJ, 2010.
[34] K. Iglberger, G. Hager, J. Treibig, and U. Rüde, Expression templates revisited: A performance analysis of current methodologies, SIAM J. Sci. Comput., 34 (2012), pp. C42–C69.
[35] M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units, SIAM J. Sci. Comput., 36 (2014), pp. C401–C423. · Zbl 1307.65055
[36] X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, Efficient sparse matrix-vector multiplication on x86-based many-core processors, in International Conference on Supercomputing, ACM, New York, 2013, pp. 273–282.
[37] R. Li and Y. Saad, GPU-accelerated preconditioned iterative linear solvers, J. Supercomput., 63 (2013), pp. 443–466.
[38] J. D. McCalpin, Memory bandwidth and machine balance in current high performance computers, Computer Society Technical Committee on Computer Architecture Newsletter, (1995), pp. 19–25.
[39] J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel programming with CUDA, Queue, 6 (2008), pp. 40–53.
[40] G. Penn, Efficient transitive closure of sparse matrices over closed semirings, Theoret. Comput. Sci., 354 (2006), pp. 72–81. · Zbl 1088.68042
[41] J. Reinders, Intel Threading Building Blocks, O’Reilly, Beijing, 2007.
[42] K. Rupp, F. Rudolf, J. Weinbub, A. Morhammer, T. Grasser, and A. Jüngel, Optimized Sparse Matrix-Matrix Multiplication for Multi-Core CPUs, GPUs, and Xeon Phi, manuscript. · Zbl 1349.65740
[43] K. Rupp, F. Rudolf, and J. Weinbub, ViennaCL - A high level linear algebra library for GPUs and multi-core CPUs, in International Workshop on GPUs and Scientific Applications, 2010, pp. 51–56.
[44] K. Rupp, Ph. Tillet, F. Rudolf, J. Weinbub, T. Grasser, and A. Jüngel, Performance portability study of linear algebra kernels in OpenCL, in Proceedings of the International Workshop on OpenCL 2013-2014, IWOCL ’14, ACM, New York, 2014, p. 8.
[45] K. Rupp, J. Weinbub, T. Grasser, and A. Jüngel, Pipelined iterative solvers with kernel fusion for graphics processing units. ACM Trans. Math. Software, 43 (2016), 11. · Zbl 1369.65055
[46] K. Rupp, J. Weinbub, F. Rudolf, A. Morhammer, T. Grasser, and A. Jüngel, A Performance Comparison of Algebraic Multigrid Preconditioners on CPUs, GPUs, and Xeon Phis, manuscript.
[47] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, Philadelphia, 2003. · Zbl 1031.65046
[48] C. Sanderson, Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments, Technical report, NICTA, 2010.
[49] E. Saule, K. Kaya, and Ü. Catalyürek, Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi, in Parallel Processing and Applied Mathematics, Lecture Notes in Comput. Sci. 8384, Springer, Heidelberg, 2014, pp. 559–570.
[50] B. Schäling, The Boost C++ Libraries, XML Press, Laguna Hills, CA, 2011.
[51] J. Schöberl, NETGEN an advancing front 2D/3D-mesh generator based on abstract rules, Comput. Vis. Sci., 1 (1997), pp. 41–52.
[52] J. E. Stone, D. Gohara, and G. Shi, OpenCL: A parallel programming standard for heterogeneous computing systems, IEEE Des. Test, 12 (2010), pp. 66–73.
[53] B.-Y. Su and K. Keutzer, clSpMV: A cross-platform OpenCL SpMV framework on GPUs, in Proceedings of the ACM International Conference on Supercomputing, ICS ’12, ACM, New York, 2012, pp. 353–364.
[54] Ph. Tillet, K. Rupp, S. Selberherr, and C.-T. Lin, Towards performance-portable, scalable, and convenient linear algebra, in 5th USENIX Workshop on Hot Topics in Parallelism (HotPar’13), USENIX, Berkeley, CA 2013.
[55] N. Trost, J. Jiménez, D. Lukarski, and V. Sanchez, Accelerating COBAYA3 on multi-core CPU and GPU systems using PARALUTION, Ann. Nucl. Energy, 82 (2015), pp. 252–259.
[56] U. Trottenberg, C. W. Oosterlee, and Anton Schüller, Multigrid, Academic, San Diego, CA, 2001.
[57] D. Vandevoorde and N. M. Josuttis, C++ Templates, Addison-Wesley, Berlin, 2002.
[58] S. Van Dongen, Graph clustering via a discrete uncoupling process, SIAM J. Matrix Anal. Appl., 30 (2008), pp. 121–141. · Zbl 1161.68041
[59] T. Veldhuizen, Expression Templates, C++ Rep., 7 (1995), pp. 26–31.
[60] S. Yan, Ch. Li, Y. Zhang, and H. Zhou, yaSpMV: Yet another SpMV framework on GPUs, in Proc. ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, PPoPP ’14, ACM, New York, 2014, pp. 107–118.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.