×

An efficient time-step-based self-adaptive algorithm for predictor-corrector methods of Runge-Kutta type. (English) Zbl 1228.65112

Summary: Finding an efficient implementation variant for the numerical solution of problems from computational science and engineering involves many implementation decisions that are strongly influenced by the specific hardware architecture. The complexity of these architectures makes it difficult to find the best implementation variant by manual tuning. For numerical solution methods from linear algebra, auto-tuning techniques based on a global search engine as they are used for ATLAS or FFTW can be used successfully. These techniques generate different implementation variants at installation time and select one of these implementation variants either at installation time or at runtime, before the computation starts. For some numerical methods, auto-tuning at installation time cannot be applied directly, since the best implementation variant may strongly depend on the specific numerical problem to be solved.
An example is solution methods for initial value problems (IVPs) of ordinary differential equations (ODEs), where the coupling structure of the ODE system to be solved has a large influence on the efficient use of the memory hierarchy of the hardware architecture. In this context, it is important to use auto-tuning techniques at runtime, which is possible because of the time-stepping nature of ODE solvers.
In this article, we present a sequential self-adaptive ODE solver that selects the best implementation variant from a candidate pool at runtime during the first time steps, i.e., the auto-tuning phase already contributes to the progress of the computation. The implementation variants differ in the loop structure and the data structures used to realize the numerical algorithm, a predictor-corrector (PC) iteration scheme with Runge-Kutta (RK) corrector considered here as an example. For those implementation variants in the candidate pool that use loop tiling to exploit the memory hierarchy of a given hardware platform we investigate the selection of tile sizes. The self-adaptive ODE solver combines empirical search with a model-based approach in order to reduce the search space of possible tile sizes. Runtime experiments demonstrate the efficiency of the self-adaptive solver for different IVPs across a range of problem sizes and on different hardware architectures.

MSC:

65L06 Multistep, Runge-Kutta and extrapolation methods for ordinary differential equations
65L05 Numerical methods for initial value problems involving ordinary differential equations
34A34 Nonlinear ordinary differential equations and systems
65Y05 Parallel numerical computation
65Y15 Packaged methods for numerical algorithms
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aho, A. V.; Lam, M. S.; Sethi, R.; Ullman, J. D., Compilers: Principles, Techniques, and Tools (2007), Pearson Education · Zbl 1429.68002
[2] Allen, R.; Kennedy, K., Optimizing Compilers for Modern Architectures: A Dependence Based Approach (2002), Morgan Kaufmann
[3] van der Houwen, P. J.; Sommeijer, B. P., Parallel iteration of high-order Runge-Kutta methods with stepsize control, Journal of Computational and Applied Mathematics, 29, 111-127 (1990) · Zbl 0682.65039
[4] Korch, M.; Rauber, T., Locality optimized shared-memory implementations of iterated Runge-Kutta methods, (Euro-Par. 2007, Parallel Processing. Euro-Par. 2007, Parallel Processing, LNCS, vol. 4641 (2007), Springer), 737-747
[5] Hartono, A.; Baskaran, M.; Ramanujam, J.; Sadayappan, P., DynTile: parametric tiled loop generation for parallel execution on multicore processors, (IPDPS 2010: Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing (2010), IEEE Computer Society)
[6] R.C. Whaley, J.J. Dongarra, Automatically tuned linear algebra software, Tech. Rep. UT-CS-97-366, University of Tennessee, 1997.; R.C. Whaley, J.J. Dongarra, Automatically tuned linear algebra software, Tech. Rep. UT-CS-97-366, University of Tennessee, 1997.
[7] J. Bilmes, K. Asanovic, C.W. Chin, J. Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performancee, ANSI C coding methodology, in: 11th ACM International Conference on Supercomputing, ICS’97, 1997.; J. Bilmes, K. Asanovic, C.W. Chin, J. Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performancee, ANSI C coding methodology, in: 11th ACM International Conference on Supercomputing, ICS’97, 1997.
[8] Tiwari, A.; Chen, C.; Chame, J.; Hall, M.; Hollingsworth, J. K., A scalable auto-tuning framework for compiler optimization, (IPDPS’09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing (2009), IEEE Computer Society: IEEE Computer Society Washington, DC, USA)
[9] Zhao, J.; Horsnell, M.; Luján, M.; Rogers, I.; Kirkham, C.; Watson, I., Adaptive loop tiling for a multi-cluster CMP, (ICA3PP’08: Proceedings of the 8th International Conference on Algorithms and Architectures for Parallel Processing (2008), Springer), 220-232
[10] Frigo, M.; Johnson, S. G., The design and implementation of FFTW3, Proceedings of the IEEE, 216-231 (2005)
[11] Püschel, M.; Moura, J. M.F.; Johnson, J.; Padua, D.; Veloso, M.; Singer, B., SPIRAL: code generation for DSP transforms, Proceedings of the IEEE, 93, 2, 232-275 (2005), (special issue on “Program Generation, Optimization, and Adaptation”)
[12] Yotov, K.; Li, X.; Ren, G.; Garzaran, M.; Padua, D.; Pingali, K., Is search really necessary to generate high-performancee BLAS?, Proceedings of the IEEE, 93, 2, 358-386 (2005)
[13] L.N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, Combined iterative and model-driven optimization in an automatic parallelization framework, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC10, New Orleans, LA, 2010.; L.N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, Combined iterative and model-driven optimization in an automatic parallelization framework, in: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC10, New Orleans, LA, 2010. · Zbl 1284.68094
[14] Rahman, M.; Pouchet, L. N.; Sadayappan, P., Neural network assisted tile size selection, (International Workshop on Automatic Performance Tuning. International Workshop on Automatic Performance Tuning, IWAPT’2010 (2010), Springer: Springer Berkeley, CA)
[15] Eijkhout, V.; Fuentes, E., Machine learning for multi-stage selection of numerical methods, (New Advances in Machine Learning (2010), INTECH), 117-136, (Chapter)
[16] Hairer, E.; Nørsett, S. P.; Wanner, G., Solving Ordinary Differential Equations I: Nonstiff Problems (2000), Springer: Springer Berlin
[17] Nørsett, S. P.; Simonsen, H. H., Aspects of parallel Runge-Kutta methods, (Numerical Methods for Ordinary Differential Equations. Numerical Methods for Ordinary Differential Equations, LNM, vol. 1386 (1989)), 103-117 · Zbl 0683.65057
[18] Ehrig, R.; Nowak, U.; Deuflhard, P., Massively parallel linearly-implicit extrapolation algorithms as a powerful tool in process simulation, (Parallel Computing: Fundamentals, Applications and New Directions (1998), Elsevier), 517-524 · Zbl 0923.68064
[19] Kappeller, M.; Kiehl, M.; Perzl, M.; Lenke, M., Optimized extrapolation methods for parallel solution of IVPs on different computer architectures, Applied Mathematics and Computation, 77, 2-3, 301-315 (1996) · Zbl 0859.65070
[20] Burrage, K., Parallel and Sequential Methods for Ordinary Differential Equations (1995), Oxford University Press: Oxford University Press New York · Zbl 0838.65073
[21] Schmitt, B. A.; Weiner, R.; Jebens, S., Parameter optimization for explicit parallel peer two-step methods, Applied Numerical Mathematics, 59, 769-782 (2008) · Zbl 1163.65051
[22] Hairer, E.; Wanner, G., Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems (2002), Springer: Springer Berlin
[23] Martí, R.; Campos, V.; Piñana, E., A branch and bound algorithm for the matrix bandwidth minimization, European Journal of Operational Research, 186, 2, 513-528 (2008) · Zbl 1138.90037
[24] Wang, Q.; Guo, Y. C.; Shi, X. W., An improved algorithm for matrix bandwidth and profile reduction in finite element analysis, Progress in Electromagnetics Research Letters, 9, 29-38 (2009)
[25] Performance Application Programming Interface, PAPI homepage. http://icl.cs.utk.edu/papi/; Performance Application Programming Interface, PAPI homepage. http://icl.cs.utk.edu/papi/
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.