zbMATH — the first resource for mathematics

GPU parameter tuning for tall and skinny dense linear least squares problems. (English) Zbl 1445.90107
Summary: Linear least squares problems (LLSPs) routinely arise in many scientific and engineering problems. One of the fastest ways to solve LLSPs involves performing calculations in parallel on graphics processing units (GPUs). However, GPU algorithms are typically designed for one GPU architecture and may be suboptimal or unusable on another GPU. To design optimal algorithms for any GPU with little need for modifying code, tuneable parameters can simplify the transition of GPU algorithms to different GPU architectures. In this paper, we investigate the benefits of using derivative-free optimization (DFO) and simulation optimization (SO) to systematically optimize tuneable parameters for a GPU or hybrid CPU/GPU LLSP solvers. Computational experiments show that both DFO and SO can be effective tools for determining optimal tuning parameters that can speed up the performance of the popular LLSP solver MAGMA by about 1.8x, compared to MAGMA’s default parameters for large tall and skinny matrices. Using DFO solvers, we were able to identify optimal parameters after enumerating an order of magnitude fewer parameter combinations than with direct enumeration. Additionally, the proposed approach is faster than a state-of-the-art autotuner and provides better tuning parameters.
90C30 Nonlinear programming
90C56 Derivative-free methods and methods using generalized derivatives
Full Text: DOI
[1] Abalenkovs, M.; Abdelfattah, A.; Dongarra, J.; Gates, M.; Haidar, A.; Kurzak, J.; Luszczek, P.; Tomov, S.; Yamazaki, I.; Yarkhan, A., Parallel programming models for dense linear algebra on heterogeneous systems, Supercomputing Front. Innov., 2, 67-86 (2015)
[2] Abramson, M.A., Audet, C., Couture, G., Dennis Jr, J.E., and Le Digabel, S., The Nomad project. Available at http://www.gerad.ca/nomad/.
[3] Adams, B.M., Bohnhoff, W.J., Dalbey, K.R., Eddy, J.P., Eldred, M.S., Gay, D.M., Haskell, K., Hough, P.D., and Swiler, L.P., DAKOTA, A Multilevel Parallel Object-Oriented Framework for Design Optimization, Parameter Estimation, Uncertainty Quantification, and Sensitivity Analysis: Version 5.2 User’s Manual. Sandia National Laboratories, Albuquerque, NM and Livermore, CA, 2011
[4] Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Ltaief, H., Thibault, S., and Tomov, S., QR factorization on a multicore node enhanced with multiple GPU accelerators. in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011, pp. 932-943.
[5] Agullo, E., Dongarra, J., Nath, R., and Tomov, S., A fully empirical autotuned dense QR factorization for multicore architectures. in European Conference on Parallel Processing, 2011, pp. 194-205.
[6] Amaran, S.; Sahinidis, N. V.; Sharda, B.; Bury, S. J., Simulation optimization: A review of algorithms and applications, Ann. Oper. Res, 240, 351-380 (2016) · Zbl 1342.90113
[7] Anderson, E.; Bai, Z.; Bischof, C.; Blackford, L. S.; Demmel, J.; Dongarra, J.; Du Croz, J.; Hammarling, S.; Greenbaum, A.; McKenney, A.; Sorensen, D., LAPACK Users’ Guide (1999), Society for Industrial and Applied Mathematics: Society for Industrial and Applied Mathematics, Philadelphia, PA, USA
[8] Anderson, M., Ballard, G., Demmel, J., and Keutzer, K., Communication-avoiding QR decomposition for GPUs. in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011, pp. 48-58.
[9] Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelly, J., Bosboom, J., O’Reilly, U.-M., and Amarasinghe, S., Opentuner: An extensible framework for program autotuning. 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), 2014, pp. 303-315.
[10] Audet, C.; Dennis Jr, J. E., Mesh adaptive direct search algorithms for constrained optimization, SIAM. J. Optim., 17, 188-217 (2006) · Zbl 1112.90078
[11] Audet, C.; Orban, D., Finding optimal algorithmic parameters using derivative-free optimization, Soc. Ind. Appl. Math., 17, 642-664 (2006) · Zbl 1128.90060
[12] Barton, R. R., A State of the art Review. Proceedings of the 1994 Winter Simulation Conference, 1994, pp. 237-244.
[13] Bélisle, C. J.; Romeijn, H. E.; Smith, R. L., Hit-and-run algorithms for generating multivariate distributions, Math. Oper. Res., 18, 255-266 (1993) · Zbl 0771.60052
[14] Blackford, L. S.; Petitet, A.; Pozo, R.; Remington, K.; Whaley, R. C.; Demmel, J.; Dongarra, J.; Duff, I.; Hammarling, S.; Henry, G., An updated set of basic linear algebra subprograms (BLAS), ACM Trans. Math. Softw., 28, 135-151 (2002) · Zbl 1070.65520
[15] Booker, A. J.; Dennis Jr, J. E.; Frank, P. D.; Serafini, D. B.; Torczon, V. J.; Trosset, M. W., A rigorous framework for optimization of expensive functions by surrogates, Struct. Optim., 17, 1-13 (1999)
[16] Brent, R. P., Algorithms for Minimization without Derivatives (1973), Prentice-Hall: Prentice-Hall, Englewood Cliffs, NJ · Zbl 0245.65032
[17] Cadzow, J., Least squares, modelling, and signal processing, Digit. Signal. Process., 4, 2-20 (1994)
[18] Chang, K.-H., Stochastic nelder-mead simplex method-a new globally convergent direct search method for simulation optimization, Eur. J. Oper. Res., 220, 684-694 (2012) · Zbl 1253.90178
[19] Chang, K.-H.; Hong, L. J.; Wan, H., Stochastic trust-region response-surface method (STRONG)—A new response-surface framework for simulation optimization, INFORMS J. Comput., 25, 2, 230-243 (2013)
[20] Chen, R.; Tsai, Y.; Wang, W., Adaptive block size for dense QR factorization in hybrid CPU-GPU systems via statistical modelling, Parallel. Comput., 40, 70-85 (2014)
[21] COIN-OR Project. Derivative Free Optimization. Available at http://projects.coin-or.org/Dfo.
[22] Cozad, A.; Sahinidis, N. V.; Miller, D. C., Automatic learning of algebraic models for optimization, AIChE J., 60, 2211-2227 (2014)
[23] Csendes, T.; Pál, L.; Sendín, J. O.H.; Banga, J. R., The GLOBAL optimization method revisited, Optim. Lett., 2, 445-454 (2008) · Zbl 1160.90660
[24] Custódio, A.L. and Vicente, L.N., SID-PSM: A pattern search method guided by simplex derivatives for use in derivative-free optimization. Departamento de Matemática, Universidade de Coimbra, Coimbra, Portugal, 2008.
[25] Demmel, J., Applied Numerical Linear Algebra (1997), SIAM: SIAM, Philadelphia, PA · Zbl 0879.65017
[26] Demmel, J.; Grigori, L.; Hoemmen, M.; Langou, J., Communication-optimal parallel and sequential QR and LU factorizations, SIAM J. Sci. Comput., 34, A206-A239 (2012) · Zbl 1241.65028
[27] Eberhart, R. and Kennedy, J., A new optimizer using particle swarm theory. in Proceedings of the Sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, 1995, pp. 39-43.
[28] Friedman, J.; Hastie, T.; Tibshirani, R., Regularization paths for generalized linear models via coordinate descent, J. Stat. Softw., 33 (2010)
[29] Gilmore, P.; Kelley, C. T., An implicit filtering algorithm for optimization of functions with many local minima, SIAM J. Optim., 5, 269-285 (1995) · Zbl 0828.65064
[30] Golub, G.; Van Loan, C., Matrix Computations (2012), JHU Press: JHU Press, Baltimore, MD
[31] Hadri, B., Ltaief, H., Agullo, E., and Dongarra, J., Tile QR factorization with parallel panel processing for multicore architectures. in Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1-10.
[32] Hansen, N., The CMA Evolution Strategy: A tutorial. Available at http://www.lri.fr/∼hansen/cmaesintro.html.
[33] Holland, J. H., Adaptation in Natural and Artificial Systems (1975), The University of Michigan Press: The University of Michigan Press, Ann Arbor, MI
[34] Holmström, K., Göran, A.O., and Edvall, M.M., User’s Guide for TOMLAB 7. Tomlab Optimization, 2010. Available at http://tomopt.com.
[35] Hooke, R.; Jeeves, T. A., Direct search solution of numerical and statistical problems, J. Assoc. Comput. Machinery, 8, 212-219 (1961) · Zbl 0111.12501
[36] Huyer, W.; Neumaier, A., Global optimization by multilevel coordinate search, J. Glob. Optim., 14, 331-355 (1999) · Zbl 0956.90045
[37] Huyer, W.; Neumaier, A., SNOBFIT-Stable noisy optimization by branch and fit, ACM Trans. Math. Softw., 35, 1-25 (2008)
[38] ICL. MAGMA, Current as of 6 July, 2017, Available at http://icl.cs.utk.edu/projectsfiles/magma/doxygen/index.html.
[39] Ingber, L., Adaptive simulated annealing (ASA): Lessons learned, Control Cybern., 25, 33-54 (1996) · Zbl 0860.93035
[40] Jia, W., Shaw, K., and Martonosi, M., Stargazer: Automated regression-based GPU design space exploration. in Performance Analysis of Systems and Software (ISPASS), 2012 IEEE International Symposium on, 2012, pp. 2-13.
[41] Jones, D. R.; Perttunen, C. D.; Stuckman, B. E., Lipschitzian optimization without the Lipschitz constant, J. Optim. Theory Appl., 79, 157-181 (1993) · Zbl 0796.49032
[42] Kelley, C.T., Users Guide for IMFIL version 1.0. Available at http://www4.ncsu.edu/∼ctk/imfil.html.
[43] Lagarias, J. C.; Reeds, J. A.; Wright, M. H.; Wright, P. E., Convergence properties of the Nelder-Mead simplex method in low dimensions, SIAM J. Optim., 9, 112-147 (1998) · Zbl 1005.90056
[44] Li, Y., Dongarra, J., and Tomov, S., A note on auto-tuning GEMM for GPUs. in International Conference on Computational Science, 2009, pp. 884-892.
[45] Luo, Y.; Duraiswami, R., Efficient parallel nonnegative least squares on multicore architectures, SIAM J. Sci. Comput., 33, 2848-2863 (2011) · Zbl 1232.65194
[46] Madougou, S.; Varbanescu, A.; de Laat, C.; van Nieuwpoort, R., The landscape of GPGPU performance modelling tools, Parallel. Comput., 56, 18-33 (2016)
[47] Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; Teller, E., Equation of state calculations by fast computing machines, J. Chem. Phys., 21, 1087-1092 (1953) · Zbl 1431.65006
[48] Nelder, J. A.; Mead, R., A simplex method for function minimization, Comput. J., 7, 308-313 (1965) · Zbl 0229.65053
[49] Neumaier, A., MCS: Global Optimization by Multilevel Coordinate Search. Available at http://www.mat.univie.ac.at/∼neum/software/mcs/. · Zbl 0956.90045
[50] NVIDIA Corporation. cuBLAS, Current as of 28 December, 2016. http://docs.nvidia.com/cuda/cublas/#axzz4SZ3ssQJO.
[51] NVIDIA Corporation. cuSolver, Current as of 28 December, 2016. Available at http://docs.nvidia.com/cuda/cusolver/#axzz4SZ3ssQJO.
[52] Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., and Purcell, T., A survey of general-purpose computation on graphics hardware. in Computer graphics forum, (2007), pp. 80-113.
[53] Pintér, J.D., Holmström, K., Göran, A.O., and Edvall, M.M., User’s Guide for TOMLAB/LGO. Tomlab Optimization, 2006, Available at http://tomopt.com.
[54] Plantenga, T.D., HOPSPACK 2.0 User Manual, Technical Report SAND2009-6265, Sandia National Laboratories, Albuquerque, NM and Livermore, CA, 2009, Available at https://software.sandia.gov/trac/hopspack/.
[55] Powell, M.J.D., Recent research at Cambridge on radial basis functions, Technical report, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, 1998. · Zbl 0958.41501
[56] Powell, M. J.D., UOBYQA: Unconstrained optimization by quadratic approximation, Math. Program., 92, 555-582 (2002) · Zbl 1014.65050
[57] Powell, M. J.D., Developments of NEWUOA for minimization without derivatives, IMA J. Numer. Anal., 28, 649-664 (2008) · Zbl 1154.65049
[58] Powell, M.J.D., The BOBYQA algorithm for bound constrained optimization without derivatives, Technical report, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, 2009.
[59] Rios, L. M.; Sahinidis, N. V., Derivative-free optimization A review of algorithms and comparison of software implementations, J. Glob. Optim., 56, 1247-1293 (2013) · Zbl 1272.90116
[60] Schreiber, R.; Van Loan, C., A storage-efficient WY representation for products of Householder transformations, SIAM J. Sci. Comput., 10, 53-57 (1989) · Zbl 0664.65025
[61] Shubert, B. O., A sequential method seeking the global maximum of a function, SIAM J. Numer. Anal., 9, 379-388 (1972) · Zbl 0251.65052
[62] Smith, R. L., Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions, Oper. Res., 32, 1296-1308 (1984) · Zbl 0552.65004
[63] Spall, J.C., Chapter 7: Simultaneous Perturbation Stochastic Approximation. in Introduction to stochastic search and optimization: Estimation, simulation, and control. Wiley-Interscience, 2003.
[64] Tomov, S.; Dongarra, J.; Baboulin, M., Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel. Comput., 36, 232-240 (2010) · Zbl 1204.68268
[65] Torczon, V. J., On the convergence of multidirectional search algorithms, SIAM J. Optim., 1, 123-145 (1991) · Zbl 0752.90076
[66] Torczon, V. J., On the convergence of pattern search algorithms, SIAM J. Optim., 7, 1-25 (1997) · Zbl 0884.65053
[67] Vaz, A.I.F., PSwarm Home Page. Available at http://www.norg.uminho.pt/aivaz/pswarm/.
[68] Volkov, V. and Demmel, J., Benchmarking GPUs to tune dense linear algebra. in 2008 SC-International Conference for High Performance Computing, Networking, Storage and Analysis, 2008, pp. 1-11.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.