×

Heterogeneous programming with single operation multiple data. (English) Zbl 1410.68075

Summary: Heterogeneity is omnipresent in today’s commodity computational systems, which comprise at least one Central Processing Unit (CPU) and one Graphics Processing Unit (GPU). Nonetheless, all this computing power is not being harnessed in mainstream computing, as the programming of these systems entails many details of the underlying architecture and execution models. Current research on parallel programming is addressing these issues but, still, the system’s heterogeneity is exposed at language level. This paper proposes a uniform framework, grounded on the Single Operation Multiple Data model, for the programming of such heterogeneous systems. We designed a simple extension of the Java programming language that embodies the model, and developed a compiler that generates code for both multi-core CPUs and GPUs. A performance evaluation attests that, despite being based on a simple programming model, the approach is able to deliver performance gains on par with hand-tuned data-parallel multi-threaded Java applications.

MSC:

68N15 Theory of programming languages
68N19 Other programming paradigms (object-oriented, sequential, concurrent, automatic, etc.)
68N20 Theory of compilers and interpreters
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Thompson, C. J.; Hahn, S.; Oskin, M., Using modern graphics architectures for general-purpose computing: a framework and analysis, (Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 2002, (2002), ACM/IEEE), 306-317
[2] Cunningham, D.; Bordawekar, R.; Saraswat, V., GPU programming in a high level language: compiling X10 to CUDA, (Proceedings of the 2011 ACM SIGPLAN X10 Workshop, X10 ’11, (2011), ACM), 8:1-8:10
[3] Sidelnik, A.; Maleki, S.; Chamberlain, B. L.; Garzarán, M. J.; Padua, D. A., Performance portability with the chapel language, (26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, (2012), IEEE Computer Society), 582-594
[4] Huynh, H. P.; Hagiescu, A.; Wong, W.-F.; Goh, R. S.M., Scalable framework for mapping streaming applications onto multi-GPU systems, (Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP’12, (2012), ACM), 1-10
[5] Dubach, C.; Cheng, P.; Rabbah, R. M.; Bacon, D. F.; Fink, S. J., Compiling a high-level language for GPUs: (via language support for architectures and compilers), (Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’12, (2012), ACM), 1-12
[6] Marques, E.; Paulino, H., Single operation multiple data: data parallelism at method level, (14th IEEE International Conference on High Performance Computing & Communication, HPCC 2012, (2012), IEEE Computer Society), 254-261
[7] Dagum, L.; Menon, R., Openmp: an industry-standard API for shared-memory programming, Comput. Sci. Eng., 5, 1, 46-55, (1998)
[8] OpenACC, The openacc application programming interface (version 1.0), (2011)
[9] Blumofe, R. D.; Joerg, C. F.; Kuszmaul, B. C.; Leiserson, C. E.; Randall, K. H.; Zhou, Y., Cilk: an efficient multithreaded runtime system, (Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 1995, (1995), ACM), 207-216
[10] Intel® Corporation; Intel® Cilk™ Plus, (July 2013)
[11] Reinders, J., Intel threading building blocks, (2007), O’Reilly & Associates, Inc.
[12] Silva, F.; Paulino, H.; Lopes, L., Di_psystem: a parallel programming system for distributed memory architectures, (Proceedings of the 6th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, (1999), Springer-Verlag), 525-532
[13] UPC Consortium, UPC language specifications, v1.2, (2005), Lawrence Berkeley National Lab, Tech. Rep. LBNL-59208
[14] Numrich, R. W.; Reid, J., Co-array Fortran for parallel programming, SIGPLAN Fortran Forum, 17, 2, 1-31, (1998)
[15] Charles, P.; Grothoff, C.; Saraswat, V.; Donawa, C.; Kielstra, A.; Ebcioglu, K.; von Praun, C.; Sarkar, V., X10: an object-oriented approach to non-uniform cluster computing, SIGPLAN Not., 40, 10, 519-538, (2005)
[16] Callahan, D.; Chamberlain, B. L.; Zima, H. P., The cascade high productivity language, (9th International Workshop on High-Level Programming Models and Supportive Environments, HIPS 2004, Santa Fe, NM, USA, 26 April 2004, (2004), IEEE Computer Society), 52-60
[17] Chamberlain, B. L.; Choi, S.-E.; Lewis, E. C.; Snyder, L.; Weathersby, W. D.; Lin, C., The case for high-level parallel programming in ZPL, Comput. Sci. Eng., 5, 3, 76-86, (1998)
[18] Dean, J.; Ghemawat, S., Mapreduce: simplified data processing on large clusters, Commun. ACM, 51, 1, 107-113, (2008)
[19] Ranger, C.; Raghuraman, R.; Penmetsa, A.; Bradski, G. R.; Kozyrakis, C., Evaluating mapreduce for multi-core and multiprocessor systems, (13st International Conference on High-Performance Computer Architecture, HPCA-13 2007, Phoenix, AZ, USA, 10-14 February 2007, (2007), IEEE Computer Society), 13-24
[20] Fang, W.; He, B.; Luo, Q.; Govindaraju, N. K., Mars: accelerating mapreduce with graphics processors, IEEE Trans. Parallel Distrib. Syst., 22, 4, 608-620, (2011)
[21] Zaharia, M.; Konwinski, A.; Joseph, A. D.; Katz, R. H.; Stoica, I., Improving mapreduce performance in heterogeneous environments, (Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, San Diego, CA, USA, December 8-10, 2008, (2008), USENIX Association), 29-42
[22] Munshi, A., The opencl specification, (2009), Khronos OpenCL Working Group
[23] NVIDIA Corporation; NVIDIA CUDA, (July 2013)
[24] Kegel, P.; Steuwer, M.; Gorlatch, S., Dopencl: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems, (26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPS 2012, Shanghai, China, May 21-25, 2012, (2012), IEEE Computer Society), 174-186
[25] Smith, L. A.; Bull, J. M.; Obdrzálek, J., A parallel Java grande benchmark suite, (Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, Denver, CO, USA, November 10-16, 2001, (2001), ACM), 8
[26] Quintin, J.-N.; Wagner, F., Hierarchical work-stealing, (D’Ambra, P.; Guarracino, M. R.; Talia, D., Euro-Par 2010 - Parallel Processing, Proceedings of the 16th International Euro-Par Conference, Part I, Ischia, Italy, August 31 - September 3, 2010, Lect. Notes Comput. Sci., vol. 6271, (2010), Springer), 217-229
[27] Fatahalian, K.; Horn, D. R.; Knight, T. J.; Leem, L.; Houston, M.; Park, J. Y.; Erez, M.; Ren, M.; Aiken, A.; Dally, W. J.; Hanrahan, P., Sequoia: programming the memory hierarchy, (Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC ’06, (2006), ACM New York, NY, USA)
[28] Bonachea, D., Gasnet specification, v1.1, (2002), Tech. Rep., Berkeley, CA, USA
[29] Saramago, J.; Mourão, D.; Paulino, H., Towards an adaptable middleware for parallel computing in heterogeneous environments, (2012 IEEE International Conference on Cluster Computing Workshops, CLUSTER Workshops 2012, Beijing, China, September 24-28, 2012, (2012), IEEE), 143-151
[30] Spafford, K.; Meredith, J. S.; Vetter, J. S., Maestro: data orchestration and tuning for opencl devices, (Proceedings of the 16th International Parallel Processing Conference, Euro-Par’10, Ischia, Italy, August 31 - September 3, 2010, (2010), Springer), 275-286
[31] Nystrom, N.; Clarkson, M. R.; Myers, A. C., Polyglot: an extensible compiler framework for Java, (Compiler Construction, Proceedings of the 12th International Conference, CC 2003, Warsaw, Poland, April 7-11, 2003, Lect. Notes Comput. Sci., vol. 2622, (2003), Springer), 138-152 · Zbl 1032.68925
[32] Bastoul, C., Code generation in the polyhedral model is easier than you think, (13th International Conference on Parallel Architectures and Compilation Techniques, PACT 2004, Antibes Juan-les-Pins, France, 29 September - 3 October 2004, (2004), IEEE Computer Society), 7-16
[33] Aparapi, Api for data-parallel Java, (2013)
[34] Pratt-Szeliga, P.; Fawcett, J.; Welch, R., Rootbeer: seamlessly using GPUs from Java, (14th IEEE International Conference on High Performance Computing & Communication, HPCC 2012, Liverpool, UK, June 25-27, 2012, (2012), IEEE Computer Society), 375-380
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.