×

Accelerated Thomas solver for (quasi-)block-tridiagonal linear algebraic equation systems, using SSE/AVX instruction sets for vectorizing dense block operations. (English) Zbl 1359.65037

Summary: Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) are additional processor instruction sets available in contemporary personal computers, designed for vectorized floating point calculations. Unfortunately, in order to utilize the advantages of these instructions, one cannot rely on automatic options of high level language compilers. Instead, handwritten assembly language or intrinsic function call insertions are necessary. By using this idea an accelerated C++ code is devised, for solving (quasi-) block-tridiagonal linear algebraic equation systems by means of an extended Thomas algorithm. Speedups reaching 3.5 and 3 (relative to C++ without using SSE/AVX) are demonstrated for single and double precision calculations, respectively.

MSC:

65F05 Direct numerical methods for linear systems and matrix inversion
65Y05 Parallel numerical computation

Software:

SPIKE; Emmerald
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aberdeen, D. and Baxter, J. [2001] ” Emmerald: A fast matrix-matrix multiply using Intel’s SSE instructions,” Concurr. Comput.: Pract. Exp.13, 103-119. genRefLink(16, ’S021987621750027XBIB001’, ’10.1002 · Zbl 1008.65504
[2] Anderson, D. V., Fry, A. R., Gruber, R. and Roy, A. [1989] ” Gigaflop speed algorithm for the direct solution of large block-tridiagonal systems in 3-D physics applications,” Comput. Phys.3, 33-41. genRefLink(16, ’S021987621750027XBIB002’, ’10.1063
[3] Ascher, U. M., Mattheij, R. M. M. and Russell, R. D. [1995] Numerical Solution of Boundary Value Problems for Ordinary Differential Equations (SIAM, Philadelphia). genRefLink(16, ’S021987621750027XBIB003’, ’10.1137
[4] Baronas, R., Ivanauskas, F. and Kulys, J. [2010] Mathematical Modeling of Biosensors (Springer, Dordrecht, The Netherlands). genRefLink(16, ’S021987621750027XBIB004’, ’10.1007
[5] Benkert, K. and Fischer, R. [2007] ”An efficient implementation of the Thomas-algorithm for block penta-diagonal systems on vector computers,” Lect. Notes Comput. Sci. Vol. 4487, pp. 144-151, in: Y. Shi, G. D. van Albada, J. Dongarra and P. M. A. Sloot (eds.), Computational Science, Proc. of the 7th International Conference ICCS 2007, Beijing, Part 1, Springer-Berlin.
[6] Bieniasz, L. K. [2001] ” Extension of the Thomas algorithm to a class of algebraic linear equation systems involving quasi-block-tridiagonal matrices with isolated block-pentadiagonal rows, assuming variable block dimensions,” Computing67, 269-285. genRefLink(16, ’S021987621750027XBIB006’, ’10.1007
[7] Bieniasz, L. K. [2003a] ” High order accurate, one-sided finite-difference approximations to concentration gradients at the boundaries, for the simulation of electrochemical reaction-diffusion problems in one-dimensional space geometry,” Comput. Biol. Chem.27, 315-325. genRefLink(16, ’S021987621750027XBIB007’, ’10.1016 · Zbl 1048.92037
[8] Bieniasz, L. K. [2003b] ” Erratum to: Extension of the Thomas algorithm to a class of algebraic linear equation systems involving quasi-block-tridiagonal matrices with isolated block-pentadiagonal rows, assuming variable block dimensions,” Computing70, 275. genRefLink(128, ’S021987621750027XBIB008’, ’000183778500005’); · Zbl 1027.65037
[9] Bieniasz, L. K. [2007] ” A fourth-order accurate, three-point compact approximation of the boundary gradient, for electrochemical kinetic simulations by the extended Numerov method,” Electrochim. Acta52, 2203-2209. genRefLink(16, ’S021987621750027XBIB009’, ’10.1016
[10] Britz, D. [1987] ” Investigation of the relative merits of some n-point current approximations in digital simulation. Application to an improved implicit algorithm for quasi-reversible systems,” Anal. Chim. Acta193, 277-285. genRefLink(16, ’S021987621750027XBIB010’, ’10.1016
[11] Britz, D. [2005] Digital Simulation in Electrochemistry (Springer, Berlin). genRefLink(16, ’S021987621750027XBIB011’, ’10.1007
[12] Bryant, R. E. and O’Halloran, D. R. [2011] Computer Systems, A Programmer’s Perspective (Prentice Hall, Boston).
[13] Bryant, R. E. and O’Halloran, D. R. [2012] ”CS: APP2e web aside MEM:BLOCKING: Using blocking to increase temporal locality,” Available at http://csapp.cs.cmu.edu/2e/waside.html [accessed on 5 January 2016].
[14] Gentzsch, W. [1984] Vectorization of Computer Programs With Applications to Computational Fluid Dynamics (Vieweg, Braunschweig). genRefLink(16, ’S021987621750027XBIB014’, ’10.1007
[15] Intel Corporation [1999] ”Application note AP-931, Streaming SIMD Extensions – LU decomposition,” Available at http://download.intel.com/design/PentiumIII/sml/24504601.pdf [accessed on 5 January 2016].
[16] Intel Corporation [2015] ”Intel intrinsic guide,” Available at https://software.intel.com/sites/landingpage/IntrinsicsGuide/, [accessed on 5 January 2016].
[17] Jain, M. K. [1984] Numerical Solution of Differential Equations (Wiley, New Delhi). · Zbl 0536.65004
[18] Lee, J. and Wright, J. C. [2014] ” A block-tridiagonal solver with two-level parallelization for finite element-spectral codes,” Comput. Phys. Commun.185, 2598-2608. genRefLink(16, ’S021987621750027XBIB018’, ’10.1016
[19] Park, N., Liu, W., Prasanna, V. K. and Raghavendra, C. [2000] ” Efficient matrix multiplication using cache conscious data layouts,” HPCMO User Group Conf. 2000. Available at: [halcyon.usc.edu/ k/prasannawebsite/papers/parkHPCMO00.pdf]
[20] Petersen, D. E., Sørensen, H. H. B., Hansen, P. C., Skelboe, S. and Stokbro, K. [2008] ” Block tridiagonal matrix inversion and fast transmission calculations,” J. Comput. Phys.227, 3174-3190. genRefLink(16, ’S021987621750027XBIB020’, ’10.1016 · Zbl 1135.65318
[21] Polizzi, E. and Sameh, A. H. [2006] ” A parallel hybrid banded system solver: The SPIKE algorithm,” Parallel Comput.32, 177-194. genRefLink(16, ’S021987621750027XBIB021’, ’10.1016
[22] Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. [1996] Numerical Recipes in C: The Art of Scientific Computing (Cambridge University Press, Cambridge). · Zbl 0892.65001
[23] Raman, S. K., Pentkovski, V. and Keshava, J. [2000] ” Implementing Streaming SIMD Extensions on the Pentium III processor,” IEEE MICRO20, 47-57. genRefLink(16, ’S021987621750027XBIB023’, ’10.1109
[24] Smith, G. D. [1985] Numerical Solution of Partial Differential Equations, Finite Difference Methods (Clarendon Press, Oxford). · Zbl 0576.65089
[25] Takahashi, A., Soliman, M. and Sedukhin, S. [2003] ”Parallel LU-decomposition on Pentium Streaming SIMD Extensions,” Lect. Notes Comput. Sci., Vol. 2858, pp. 423-430, in: A. Veidenbaum, K. Joe, H. Amano and H. Aiso (eds.), High Performance Computing, Proc. of the 5th International Symposium, ISHPC 2003, Springer, Berlin.
[26] Thomas, L. H. [1949] Elliptic Problems in Linear Difference Equations over a Network, Watson Scientific Computing Laboratory Report, Columbia University, New York.
[27] Wang, Y., Baboulin, M., Dongarra, J., Falcou, J., Fraigneau, Y. and Le Maǐtre, O. [2013] ” A parallel solver for incompressible fluid flows,” Procedia Comput. Sci.18, 439-448. genRefLink(16, ’S021987621750027XBIB027’, ’10.1016
[28] Yukhin, K. [2014] ”Intel Advanced Vector Extensions, 2015/2016 support in GNU compiler collection,” GNU Tools Cauldron, Available at https://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=get&target=Cauldron14_AVX-512_Vector_ISA_Kir ill_Yukhin_20140711.pdf [accessed on 5 January 2016].
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.