zbMATH — the first resource for mathematics

Fast multipole methods on graphics processors. (English) Zbl 1147.65012
Summary: The fast multipole method (FMM) allows the rapid approximate evaluation of sums of radial basis functions. For a specified accuracy, \(\varepsilon\), the method scales as \(O(N)\) in both time and memory compared to the direct method with complexity \(O(N^{2})\), which allows the solution of larger problems with given resources. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture.
The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on data-parallel processors. We describe strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architecture; and determine optimal settings for the FMM on the GPU. These optimal settings are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described.
We obtain accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range of 30-60 compared to a serial CPU FMM implementation. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at a 43 Teraflop rate if we use straightforward summation.

65D15 Algorithms for approximation of functions
65B10 Numerical summation of series
65N38 Boundary element methods for boundary value problems involving PDEs
65Y05 Parallel numerical computation
65Y15 Packaged methods for numerical algorithms
65Y20 Complexity and performance of numerical algorithms
CUDA; Algorithm 719
Full Text: DOI
[1] NVIDIA CUDA Compute Unified Device Architecture Programming Guide, V. 1.0, 06/01/2007. <http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf>.
[2] Abramowitz, M.; Stegun, I.A., Handbook of mathematical functions, (1965), Dover Publications · Zbl 0515.33001
[3] Cheng, H.; Greengard, L.; Rokhlin, V., A fast adaptive multipole algorithm in three dimensions, J. comput. phys., 155, 468-498, (1999) · Zbl 0937.65126
[4] Dongarra, J.J.; Sullivan, F., The top 10 algorithms, Comput. sci. eng., 2, 22-23, (2000)
[5] Epton, M.A.; Dembart, B., Multipole translation theory for the three-dimensional Laplace and Helmholtz equations, SIAM J. sci. comput., 16, 4, 865-897, (1995) · Zbl 0852.31006
[6] D. Goeddeke, R. Strzodka, S. Turek, Accelerating double precision FEM simulations with GPUs, in: F. Hülsemann, M. Kowarschik, U. Rüde (Eds.), Frontiers in Simulation (Proceedings of ASIM 2005), SCS Publishing House, 2005, pp. 139-144.
[7] Greengard, L.; Rokhlin, V., A fast algorithm for particle simulations, J. comput. phys., 73, 325-348, (1987) · Zbl 0629.65005
[8] Greengard, L.; Gropp, W.D., A parallel version of the fast multipole method, Comput. math. app., 20, (1990) · Zbl 0715.65015
[9] Greengard, L.; Rokhlin, V., A new version of the fast multipole method for the Laplace equation in three dimensions, Acta numer., 6, 229-269, (1997) · Zbl 0889.65115
[10] Gumerov, N.A.; Duraiswami, R., Fast multipole methods for the Helmholtz equation in three dimensions, (2005), Elsevier Oxford, UK
[11] Gumerov, N.A.; Duraiswami, R., Fast multipole method for the biharmonic equation in three dimensions, J. comput. phys., 215, 1, 363-383, (2006) · Zbl 1103.65122
[12] N.A. Gumerov, R. Duraiswami, Comparison of the efficiency of translation operators used in the fast multipole method for the 3D Laplace equation, University of Maryland Technical Report UMIACS-TR-#2003-28, (<http://www.cs.umd.edu/Library/TRs/CS-TR-4701/CS-TR-4701.pdf>). · Zbl 1103.65122
[13] N.A. Gumerov, R. Duraiswami, W. Dorland, Middleware for programming NVIDIA GPUs from Fortran 9X. (poster presentation at Supercomputing 2007. <www.umiacs.umd.edu/ ramani/pubs/Gumerov_SC07_poster.pdf>).
[14] Fukushige, T.; Makino, J.; Kawai, A., GRAPE-6A: A single-card GRAPE-6 for parallel PC-GRAPE cluster systems, PASJ: publ. astron. soc. jpn., 57, 1009-1021, (2005)
[15] T. Hamada. T. Iitaka, The Chamomile scheme: an optimized algorithm for N-body simulations on programmable graphics processing units (posted on arXiv: astro-ph/0703100v1, 6 March 2007).
[16] J.F. Leathrum, J.A. Board, The parallel fast multipole algorithm in three dimensions, Technical report, Duke University, April 1992.
[17] Makino, J.; Fukushige, T.; Koga, M.; Namura, K., GRAPE-6: massively-parallel special-purpose computer for astrophysical particle simulations, Publ. astronom. soc. jpn., 55, 1163-1187, (2001)
[18] L. Nyland, M. Harris, J. Prins, Fast N-body Simulations with CUDA, in: Hubert Nguyen (Ed.), GPU Gems 3, 31, 677-695. Addison Wesley, August 2007.
[19] J.P. Singh, C. Holt, J.L. Hennessy, A. Gupta, A parallel adaptive fast multipole method, in: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, December 1993.
[20] Singh, J.P.; Hennessy, J.L.; Gupta, A., Implications of hierarchical N-body methods for multiprocessor architectures, ACM trans. comput. syst., 13, 2, 141-202, (1995)
[21] Singh, J.P.; Holt, C.; Totsuka, T.; Gupta, A.; Hennessy, J.L., Load balancing and data locality in adaptive hierarchical N-body methods: B-H, FMM, and radiosity, J. parallel distrib. comput., 27, 118-141, (1995) · Zbl 0852.70009
[22] Stone, J.E.; Phillips, J.C.; Freddolino, P.L.; Hardy, D.J.; Trabuco, L.G.; Schulten, K., Accelerating molecular modeling applications with graphics processors, J. comput. chem., 28, 2618-2640, (2007)
[23] Owens, John D.; Luebke, David; Govindaraju, Naga; Harris, Mark; Krüger, Jens; Lefohn, Aaron E.; Purcell, Timothy J., A survey of general-purpose computation on graphics hardware, Comput. graphics forum, 26, 80-113, (2007)
[24] Teng, S.-H., Provably good partitioning and load balancing algorithms for parallel adaptive N-body simulation, SIAM J. sci. comput., 19, 2, 635-656, (1998) · Zbl 0907.68219
[25] S. Warren, M, K. Salmon, J, J. Becker, D, P. Goda, M, T. Sterling, Pentium Pro inside: I.A treecode at 430 Gigaflops on ASCI Red, II. Price/Performance of \(50/Mflop on Loki and Hyglac,” in: Proceedings of the Supercomputing 97, in CD-ROM. IEEE, Los Alamitos, CA, 1997\)
[26] White, C.A.; Head-Gordon, M., Rotation around the quartic angular momentum barrier in fast multipole method calculations, J. chem. phys., 105, 12, 5061-5067, (1996)
[27] M.J. Stock, A. Gharakhani. Toward efficient GPU-accelerated N-body simulations, in: 46th AIAA Aerospace Sciences Meeting and Exhibit, AIAA 2008-608, January 2008, Reno, Nevada.
[28] Portegies Zwart, S.F.; Belleman, R.G.; Geldof, P.M., High-performance direct gravitational N-body simulations on graphics processing units, New astron., 12, 8, 641-650, (2007)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.