A bridging model for multi-core computing. (English) Zbl 1210.68134

Summary: Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software systems to be written that are efficiently portable across changing hardware platforms. These resources are currently lacking in the area of multi-core architectures, where a programmer seeking high performance has no comparable opportunity to build on the intellectual efforts of others. In order to address this problem we propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model. Portable algorithms would contain efficient designs for all reasonable combinations of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines. Our Multi-BSP model is a multi-level model that has explicit parameters for processor numbers, memory/cache sizes, communication costs, and synchronization costs. The lowest level corresponds to shared memory or the PRAM, acknowledging the relevance of that model for whatever limitations on memory and processor numbers it may be efficacious to emulate it. We propose parameter-aware portable algorithms that run efficiently on all relevant architectures with any number of levels and any combination of parameters. For these algorithms we define a parameter-free notion of optimality. We show that for several fundamental problems, including standard matrix multiplication, the fast Fourier transform, and comparison sorting, there exist optimal portable algorithms in that sense, for all combinations of machine parameters. Thus some algorithmic generality and elegance can be found in this many parameter setting.


68W10 Parallel algorithms in computer science
68M99 Computer system organization


Full Text: DOI


[1] Aggarwal, A.; Alpern, B.; Chandra, A.; Snir, M., A model for hierarchical memory, (), 305-314
[2] Aggarwal, A.; Chandra, A.K.; Snir, M., Communication complexity of prams, Theoret. comput. sci., 71, 1, 3-28, (1990) · Zbl 0699.68054
[3] Aggarwal, A.; Vitter, J.S., The input/output complexity of sorting and related problems, Comm. ACM, 31, 9, 1116-1127, (1988)
[4] Alpern, B.; Carter, L.; Feig, E.; Selker, T., A uniform memory hierarchy model of computation, Algorithmica, 12, 72-109, (1994) · Zbl 0938.68638
[5] Arge, L.; Goodrich, M.; Nelson, M.; Sitchinava, N., Fundamental parallel algorithms for private-cache chip multiprocessor, (), 197-206
[6] G. Bilardi, C. Fantozzi, A. Pietracaprina, G. Pucci, On the effectiveness of D-BSP as a bridging model of parallel computation, in: International Conference on Computational Science, 2001, pp. 579-588. · Zbl 0983.68681
[7] Bilardi, G.; Pietracaprina, A.; Pucci, G.; Herley, K.T.; Spirakis, P., BSP versus logp, Algorithmica, 24, 405-422, (1999) · Zbl 0943.68067
[8] Bilardi, G.; Pietracaprina, A.; Pucci, G.; Silvestri, F., Network-oblivious algorithms, (), 1-10
[9] Blelloch, G.; Chowdhury, R.; Gibbons, P.; Ramachandran, V.; Chen, S.; Kozuch, M., Provably good multicore cache performance for divide and conquer algorithms, (), 501-510 · Zbl 1192.68026
[10] Bonorden, O., The Paderborn university BSP (PUB) library, Parallel comput., 29, 2, 187-207, (2003)
[11] Chowdhury, R.; Ramachandran, V., Cache-efficient dynamic programming algorithms for multicores, (), 207-216
[12] Culler, D.E.; Karp, R.M.; Patterson, D.; Sahay, A.; Santos, E.E.; Schauser, K.E.; Subramonian, R.; von Eicken, T., Logp: A practical model of parallel computation, Comm. ACM, 39, 11, 78-85, (1996)
[13] de la Torre, P.; Kruskal, C.P., Submachine locality in the bulk synchronous setting (extended abstract), (), 352-358
[14] Dehne, F.; Ferreira, A.; Caceres, E.; Song, S.W.; Roncato, A., Efficient parallel graph algorithms for coarse-grained multicomputers and BSP, Algorithmica, 33, 183-200, (2002) · Zbl 0994.68177
[15] Frigo, M.; Leiserson, C.E.; Prokop, H.; Ramachandran, S., Cache-oblivious algorithms, (), 285-298
[16] Fortune, S.; Wyllie, C., Parallelism in random access machines, (), 114-118 · Zbl 1282.68104
[17] Gerbessiotis, A.V.; Siniolakis, C.J., Efficient deterministic sorting on the BSP model, Parallel process. lett., 9, 1, 69-79, (1999)
[18] Gerbessiotis, A.V.; Valiant, L.G., Direct bulk-synchronous parallel algorithms, J. parallel distributed comput., 22, 251-267, (1994)
[19] Goudreau, M.; Lang, K.; Rao, S.; Suel, T.; Tsantilas, T., Towards efficiency and portability: programming with the BSP model, (), 1-12
[20] Goudreau, M.W.; Lang, K.; Narlikar, G.; Rao, S.B., BOS is boss: A case for bulk-synchronous object systems, (), 115-125
[21] Hill, J.M.D., Bsplib: the BSP programming library, Parallel comput., 24, 14, 1947-1980, (1998)
[22] Hill, M.D.; Marty, M.R., Amdahl’s law in the multicore era, IEEE comput., 41, 33-38, (July 2008)
[23] Hong, J.; Kung, H., I/O-complexity: the red-blue pebble game, (), 326-333
[24] Hou, Q.; Zhou, K.; Guo, B., BSGP: bulk-synchronous GPU programming, ACM trans. graphics, 27, 3, 19:1-19:3, (2008)
[25] Irony, D.; Toledo, S.; Tiskin, A., Communication lower bounds for distributed-memory matrix multiplication, J. parallel distributed comput., 64, 9, 1017-1026, (2004) · Zbl 1114.68081
[26] Karp, R.M.; Ramachandran, V., Parallel algorithms for shared memory machines, (), 869-941 · Zbl 0900.68267
[27] Kumar, R.; Tullsen, D.; Jouppi, N.; Ranganathan, P., Heterogeneous chip multiprocessors, IEEE comput., 32-38, (November 2005)
[28] McColl, W.F.; Tiskin, A., Memory-efficient matrix computations in the BSP model, Algorithmica, 24, 3-4, 287-297, (1999) · Zbl 0943.68066
[29] M.H. Nodine, J.S. Vitter, Optimal deterministic sorting on parallel processors and parallel memory hierarchies, manuscript, 1993. Also: Deterministic distribution sort in shared and distributed memory multiprocessors, in: Proc. ACM Symp. on Parallel Algorithms and Architectures, 1993, pp. 120-129.
[30] V. Ramachandran, Note: Dynamic programming and hierarchical divide and conquer for hierarchical multi-level memories, personal communication, July 2008.
[31] Sampson, J.; Gonzalez, R.; Collard, J.-F.; Jouppi, N.P.; Schlansker, M.S., Fast synchronization for chip multiprocessors, SIGARCH comput. architecture news, 33, 4, 64-69, (2005)
[32] Savage, J.E., Extending the Hong-Kung model to memory hierarchies, (), 270-281
[33] Savage, J.E., Models of computation, (1998), Addison-Wesley Reading, MA
[34] Savage, J.E.; Zubair, M., A unified model of multicore architectures, ()
[35] Shi, H.; Schaeffer, J., Parallel sorting by regular sampling, J. parallel distributed comput., 14, 362-372, (1992) · Zbl 0763.68030
[36] Tiskin, A., The bulk-synchronous parallel random access machine, Theoret. comput. sci., 196, 1-2, 109-130, (1998) · Zbl 0902.68072
[37] Valiant, L.G., A bridging model for parallel computation, Comm. ACM, 33, 8, 103-111, (1990)
[38] Vishkin, U.; Dascal, S.; Berkovich, E.; Nuzman, J., Explicit multi-threading (XMT) bridging models for instruction parallelism, (), 140-151
[39] Vitter, J.S.; Shriver, E.A.M., Algorithms for parallel memory II: hierarchical multilevel memories, Algorithmica, 12, 2/3, 148-169, (1994) · Zbl 0917.68086
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.