zbMATH — the first resource for mathematics

Code modernization strategies to 3-D stencil-based applications on intel Xeon Phi: KNC and KNL. (English) Zbl 1398.65370
Summary: Partial differential equations (PDEs) are widely used to simulate many scenarios in science and engineering, usually solved through iterative techniques (e.g., Jacobi, Gauss-Seidel). These methods produce an approximate solution to the problem based on Stencil patterns of computation. The complexity, granularity and dimensionality of the problem require of substantial computational resources that are not affordable by regular CPU-based architectures.
Emerging massively data-parallel architectures, such as Intel Xeon Phi, offer a great opportunity to address challenging problems based on PDEs. However, the code migration to these architectures is not straight-forward. To achieve this code modernization programming cycle, it is mandatory to identify the key issues in the code that will determine performance in future hardware evolutions. In this paper we look for (1) scalability with core count, (2) data-parallelism exposure to explore vectorization capabilities, and (3) data-locality aware techniques. These techniques lead a performance gain of up to 15x for the first generation of Xeon Phi: Knights Corner (KNC), and an additional average 2.5x improvement for Knights Landing (KNL).
65Y05 Parallel numerical computation
68M07 Mathematical problems of computer architecture
Full Text: DOI
[1] Yaqoob, I.; Chang, V.; Gani, A.; Mokhtar, S.; Hashem, I. A.T.; Ahmed, E.; Anuar, N. B.; Khan, S. U., Information fusion in social big data: foundations, state-of-the-art, applications, challenges, and future research directions, Int. J. Inf. Manage., (2016)
[2] Sodani, A.; Gramunt, R.; Corbal, J.; Kim, H.-S.; Vinod, K.; Chinthamani, S.; Hutsell, S.; Agarwal, R.; Liu, Y.-C., Knights landing: second-generation intel xeon phi product, IEEE Micro, 36, 2, 34-46, (2016)
[3] Pagani, S.; Khdr, H.; Kriebel, F.; Rehman, S.; Shafique, M., Towards performance and reliability-efficient computing in the dark silicon era, (2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), (2016), IEEE), 1-6
[4] Cecilia, J. M.; García, J. M.; Nisbet, A.; Amos, M.; Ujaldón, M., Enhancing data parallelism for ant colony optimization on gpus, J. Parallel Distrib. Comput., 73, 1, 42-51, (2013)
[5] M. Pearce, What is Code Modernization?, 2015, https://Software.Intel.Com/En-Us/Articles/What-Is-Code-Modernization.
[6] Cecilia, J. M.; Abellán, J. L.; Fernández, J.; Acacio, M. E.; García, J. M.; Ujaldón, M., Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus cell BE, J. Supercomput., 62, 2, 787-803, (2012)
[7] Jeffers, J.; Reinders, J., Intel Xeon Phi Coprocessor High Performance Programming, (2013), Morgan Kaufmann Publishers Inc Boston, MA, USA
[8] Keckler, S. W.; Dally, W. J.; Khailany, B.; Garland, M.; Glasco, D., GPUs and the future of parallel computing, IEEE Micro, 31, 5, 7-17, (2011)
[9] Reinders, J.; Jeffers, J., (High Performance Parallelism Pearls, Multicore and Many-core Programming Approaches, (2014), Morgan Kaufmann), 377-396
[10] Trottenberg, U.; Oosterlee, C. W.; Schuller, A., Multigrid, (2000), Academic press
[11] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, K. Yelick, Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, in: Proceedings of the ACM/IEEE Conference on Supercomputing, SC’08, 2008, pp. 4:1-4:12.
[12] Frigo, M.; Strumpen, V., Cache oblivious stencil computations, (Proc. of the 19th Annual International Conference on Supercomputing, ICS’05, (2005), ACM New York, USA), 361-366
[13] Zhukov, V.; Krasnov, M.; Novikova, N.; Feodoritova, O., Multigrid effectiveness on modern computing architectures, Program. Comput. Softw., 41, 1, 14-22, (2015)
[14] Komatitsch, D.; Erlebacher, G.; Göddeke, D.; Michéa, D., High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster, J. Comput. Phys., 229, 20, 7692-7714, (2010) · Zbl 1194.86019
[15] Williams, S.; Waterman, A.; Patterson, D., Roofline: an insightful visual performance model for multicore architectures, Commun. ACM, 52, 4, 65-76, (2009)
[16] Dufort, E. C.; Frankel, S. P., Stability conditions in the numerical treatment of parabolic differential equations, (Mathematical Tables and Others Aids To Computation, Vol. 7, (1953)), 135-152 · Zbl 0053.26401
[17] Navarro, J. M.; Escolano, J.; López, J. J., Implementation and evaluation of a diffusion equation model based on finite difference schemes for sound field prediction in rooms, Appl. Acoust., 73, 6, 659-665, (2012)
[18] Golub, G.; Ortega, J. M., Scientific Computing: An Introduction with Parallel Computing, (1993), Academic Press, Inc. · Zbl 0790.65001
[19] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, S. Weeratunga, The NAS Parallel Benchmarks, Technical report, RNR-94-007, NASA Advanced Supercomputing (NAS) Division, 1994.
[20] Weiss, R. M.; Shragge, J., Solving 3D anisotropic elastic wave equations on parallel GPU devices, Geophysics, 78, 2, F7-F15, (2013)
[21] Kamil, S.; Husbands, P.; Oliker, L.; Shalf, J.; Yelick, K., Impact of modern memory subsystems on cache optimizations for stencil computations, (Proceedings of the 2005 Workshop on Memory System Performance, MSP’05, (2005), ACM New York, NY, USA), 36-43
[22] Rahman, S. M.F.; Yi, Q.; Qasem, A., Understanding stencil code performance on multicore architectures, (Proceedings of the 8th ACM International Conference on Computing Frontiers, CF’11, (2011), ACM New York, NY, USA), 30:1-30:10
[23] R. Strzodka, M. Shaheen, D. Pajak, W. Pomeranian, Impact of System and Cache Bandwidth on Stencil Computations Across Multiple Processor Generations in: Proceedings of the Workshop on Applications for Multi-and Many-Core Processors (A4MMC) At ISCA, 2011.
[24] McCool, M.; Robison, A. D.; Reinders, J., Chapter 7 - stencil and recurrence, (Structured Parallel Programming: Patterns for Efficient Computation, (2012), Morgan Kaufmann Publishers Inc. Boston, MA, USA), 199-207
[25] J. Fang, A.L. Varbanescu, H. Sips, L. Zhang, Y. Che, C. Xu, An Empirical Study of Intel Xeon Phi, 2013, ArXiv Preprint ArXiv:1310.5842.
[26] Hernández, M.; Imbernón, B.; Navarro, J. M.; García, J. M.; Cebrián, J. M.; Cecilia, J. M., Evaluation of the 3-D finite difference implementation of the acoustic diffusion equation model on massively parallel architectures, Comput. Electr. Eng., 46, 190-201, (2015)
[27] Kamil, S.; Datta, K.; Williams, S.; Oliker, L.; Shalf, J.; Yelick, K., Implicit and explicit optimizations for stencil computations, (Proc. of the Workshop on Memory System Performance and Correctness, MSPC’06, (2006), ACM New York, USA), 51-60
[28] de la Cruz, R.; Araya-Polo, M., Modeling stencil computations on modern HPC architectures, (5th Int. Workshop (PMBS14) Held As Part of SC14, LNCS, (2014), Springer)
[29] Stengel, H.; Treibig, J.; Hager, G.; Wellein, G., Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model, (Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, (2015), ACM New York, NY, USA), 207-216
[30] Henretty, T.; Stock, K.; Pouchet, L.-N.; Franchetti, F.; Ramanujam, J.; Sadayappan, P., Data layout transformation for stencil computations on short-vector SIMD architectures, (Compiler Construction, (2011), Springer), 225-245
[31] Treibig, J.; Hager, G., Introducing a performance model for bandwidth-limited loop kernels, (Parallel Processing and Applied Mathematics, (2010), Springer), 615-624
[32] Datta, K.; Kamil, S.; Williams, S.; Oliker, L.; Shalf, J.; Yelick, K., Optimization and performance modeling of stencil computations on modern microprocessors, SIAM Rev., 51, 1, 129-159, (2009) · Zbl 1160.65359
[33] McCool, M.; Robison, A.; Reinders, J., Structured parallel programming: patterns for efficient computation, (2012), Morgan Kaufmann Publishers Inc. Boston, MA, USA
[34] (Reinders, J.; Jeffers, J., High performance parallelism pearls: multicore and many-core programming approaches, Vol. 1, (2015), Morgan Kaufmann Publishers Inc. Boston, MA, USA), 600
[35] (Reinders, J.; Jeffers, J., High performance parallelism pearls: multicore and many-core programming approaches, Vol. 2, (2015), Morgan Kaufmann Publishers Inc. Boston, MA, USA), 592
[36] Rahman, R., Intel xeon phi coprocessor architecture and tools: the guide for application developers, (2013), Apress Berkely, CA, USA
[37] (Vladimirov, A.; Asai, R.; Karpusenko, V., Parallel programming and optimization with intel xeon phi coprocessors, Vol. 1, (2015), Colfax International CA, USA), 520
[38] Jeffers, J.; Reinders, J., Chapter 4 - driving around town: optimizing a real-world code example, (Intel Xeon Phi Coprocessor High Performance Programming, (2013), Morgan Kaufmann Publishers Inc. Boston, MA, USA), 83-106
[39] Andreolli, C.; Thierry, P.; Borges, L.; Skinner, G.; Yount, C., Chapter 23 - characterization and optimization methodology applied to stencil computations, (Reinders, J.; Jeffers, J., High Performance Parallelism Pearls: Multicore and Many-Core Programming Approaches, Vol. 1, (2015), Morgan Kaufmann Boston, MA, USA), 377-396
[40] Mucci, P. J.; Browne, S.; Deane, C.; Ho, G., PAPI: A portable interface to hardware performance counters, (Proc. of HPCMP Users Group Conf., 1999, (1999)), 7-10
[41] J.M. Cebrian, L. Natvig, Temperature effects on on-chip energy measurements, in: Proceedings IGCC 2013, Arlington, USA, 2013, 2013, pp. 1-6.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.