Reliable generation of high-performance matrix algebra. (English) Zbl 1347.68066


68N20 Theory of compilers and interpreters
65-04 Software, source code, etc. for problems pertaining to numerical analysis
65Fxx Numerical linear algebra
65Y15 Packaged methods for numerical algorithms
65Y20 Complexity and performance of numerical algorithms
Full Text: DOI arXiv


[1] G. Almasi, L. D. Rose, J. Moreira, and D. Padua. 2003. Programming for locality and parallelism with hierarchically tiled arrays. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing. 162–176. · Zbl 1099.68549
[2] S. Amarasinghe, D. Campbell, W. Carlson, et al. 2009. Exascale software study: Software challenges in extreme scale systems. Tech. Rep., DARPA IPTO, Air Force Research Labs.
[3] W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. 1999. Achieving high sustained performance in an unstructured mesh CFD application. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (CDROM) (Supercomputing’99). ACM, New York.
[4] P. Balaprakash, S. Wild, and P. Hovland. 2011. Can search algorithms save large-scale automatic performance tuning? Procedia Comput. Sci. CS 4, 2136–2145.
[5] G. Baumgartner, A. Auer, D. E. Bernholdt, et al. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2, 276–292.
[6] G. Belter, E. R. Jessup, I. Karlin, and J. G. Siek. 2009. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, 1–12.
[7] G. Belter, J. G. Siek, I. Karlin, and E. R. Jessup. 2010. Automatic generation of tiled and parallel linear algebra routines. In Proceedings of the 5th International Workshop on Automatic Performance Tuning (iWAPT’10), 1–15.
[8] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. 1997. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing (ICS’97). ACM, New York, 340–347.
[9] L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, and R. C. Whaley. 2002. An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135–151. · Zbl 1070.65520
[10] U. Bondhugula, O. Gunluk, S. Dash, and L. Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, 343–352.
[11] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. Pluto: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08), 101–113.
[12] C. Chen, J. Chame, and M. Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Tech. Rep. 08-897, Department of Computer Science, University of Southern California.
[13] L. Dagum and R. Menon. 1998. OpenMP: An industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1, 46–55. · Zbl 05091767
[14] A. Darte and G. Huard. 2000. Loop shifting for loop parallelization. Tech. Rep. 2000-22, Ecole Normale Superieure de Lyon. · Zbl 1054.68537
[15] J. J. Dongarra, J. D. Croz, S. Hammarling, and I. Duff. 1990. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1, 1–17. · Zbl 0900.65115
[16] J. J. Dongarra, J. D. Croz, S. Hammarling, and R. J. Hanson. 1988. An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 14, 1, 1–17. · Zbl 0639.65016
[17] K. Goto and R. Van De Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. (TOMS) 35, 1, 4.
[18] J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn. 2001. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Softw. 27, 4, 422–455. · Zbl 1070.65522
[19] A. Hartono, B. Norris, and P. Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS’09). IEEE Computer Society, Los Alamitos, CA, 1–11. (Also available as Preprint ANL/MCS-P1556-1008).
[20] G. W. Howell, J. W. Demmel, C. T. Fulton, S. Hammarling, and K. Marmol. 2008. Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans. Math. Softw. 34, 14:1–14:33. · Zbl 1190.65056
[21] Intel. 2012. Intel Composer. http://software.intel.com/en-us/articles/intel-compilers.
[22] I. Karlin, E. Jessup, G. Belter, and J. G. Siek. 2011. Parallel memory prediction for fused linear algebra kernels. SIGMETRICS Perform. Eval. Rev. 38, 43–49. · Zbl 05891894
[23] I. Karlin, E. Jessup, and E. Silkensen. 2012. Modeling the memory and performance impacts of loop fusion. J. Computat. Sci. 3, 3, 120–126.
[24] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3, 308–323. · Zbl 0412.65022
[25] B. Marker, D. Batory, and R. van de Geijn. 2013. A case study in mechanically deriving dense linear algebra code. Int. J. High Perf. Comput. Appl. 27, 4, 439–452.
[26] N. Megiddo and V. Sarkar. 1997. Optimal weighted loop fusion for parallel programs. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97). ACM, New York, 282–291.
[27] M. Mitchell. 1998. An Introduction to Genetic Algorithms. The MIT Press. · Zbl 0906.68113
[28] F. Mueller. 1999. Pthreads library interface. Tech. rep., Florida State University.
[29] Netlib BLAS. 2013. Netlib BLAS. http://www.netlib.org/blas/.
[30] Portland Group. 2012. Portland group compiler. http://www.pgroup.com.
[31] L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. 2010. Combined iterative and model-driven optimization in an automatic parallelization framework. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10). IEEE Computer Society, Los Alamitos, CA, 1–11.
[32] M. Pueschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE, (special issue on Program Generation, Optimization, and Adaptation ) 93, 2, 232–275.
[33] A. Qasem, G. Jin, and J. Mellor-Crummey. 2003. Improving performance with integrated program transformations. Tech. Rep. TR03-419, Department of Computer Science, Rice University. October.
[34] A. Qasem, K. Kennedy, and J. Mellor-Crummey. 2006. Automatic tuning of whole applications using direct search and a performance-based transformation system. J. Supercomput: (Special Issue on Computer Science Research Supporting High-Performance Applications) 36, 9, 183–196.
[35] J. G. Siek. 1999. A modern framework for portable high performance numerical linear algebra. M.S. thesis, University of Notre Dame. · Zbl 0943.65037
[36] J. G. Siek, I. Karlin, and E. R. Jessup. 2008. Build to order linear algebra kernels. In Proceedings of the Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL’08).
[37] A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. 2009. A scalable autotuning framework for compiler optimization. In Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium.
[38] F. G. Van Zee, R. A. van de Geijn, G. Quintana-Ortí, and G. J. Elizondo. 2012. Families of algorithms for reducing a matrix to condensed form. ACM Trans. Math. Softw. 39, 1, 2:1–2:32. · Zbl 1295.65052
[39] R. Vuduc, J. W. Demmel, and J. A. Bilmes. 2004. Statistical models for empirical search-based performance tuning. Int. J. High Perf. Comput. Appl. 18, 1, 65–94. · Zbl 0982.68776
[40] R. C. Whaley and J. J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM) (Supercomputing’98). IEEE Computer Society, Los Alamitos, CA, 1–27.
[41] Q. Yi and A. Qasem. 2008. Exploring the optimization space of dense linear algebra kernels. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC’08). Springer-Verlag, Berlin, 343–355. · Zbl 05380689
[42] Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan 2007. POET: Parameterized optimizations for empirical tuning. In Proceedings of the Parallel and Distributed Processing Symposium, 2007. IEEE, 1–8.
[43] Y. Zhao, Q. Yi, K. Kennedy, D. Quinlan, and R. Vuduc. 2005. Parameterizing loop fusion for automated empirical tuning. Tech. Rep. UCRL-TR-217808, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.