×

Emmerald: a fast matrix-matrix multiply using Intel’s SSE instructions. (English) Zbl 1008.65504

Summary: Generalized matrix-matrix multiplication forms the kernel of many mathematical algorithms, hence a faster matrix-matrix multiply immediately benefits these algorithms. In this paper we implement efficient matrix multiplication for large matrices using the Intel Pentium single instruction multiple data (SIMD) floating point architecture. The main difficulty with the Pentium and other commodity processors is the need to efficiently utilize the cache hierarchy, particularly given the growing gap between main-memory and CPU clock speeds. We give a detailed description of the register allocation, Level 1 and Level 2 cache blocking strategies that yield the best performance for the Pentium III family. Our results demonstrate an average performance of 2.09 times faster than the leading public domain matrix-matrix multiply routines and comparable performance with Intel’s SIMD small matrix-matrix multiply routines.

MSC:

65Y10 Numerical algorithms for specific classes of architectures
65F30 Other matrix algorithms (MSC2010)

Software:

PHiPAC; Emmerald; ATLAS
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Netlib. Basic linear algebra subroutines. http://www.netlib.org/blas/index.html [November 1998].
[2] PHiPAC: A portable, high-performance, ANSI C coding methodology and its application to matrix multiply. Technical Report, University of Tennessee. http://www.icsi.berkeley.edu/?bilmes/phipac [August 1996].
[3] Automatically tuned linear algebra software. Technical Report, Computer Science Department, University of Tennessee. http://www.netlib.org/utk/projects/atlas/ [1997].
[4] High performance software on Intel Pentium Pro processors or Micro-Ops to TeraFLOPS. Technical Report, Intel http://www.cs.utk.edu/?ghenry/sc97/paper.htm [August 1997].
[5] Intel. Intel architecture optimization reference manual. http://developer.intel.com/design/PentiumII/manuals/245127.htm [June 1999].
[6] Strassen, Numerische Mathematik 13 pp 354– (1969) · Zbl 0185.40101 · doi:10.1007/BF02165411
[7] Tuning Strassen’s matrix multiplication for memory efficiency. Proceedings of Super Computing ’98, Orlando, FL, November 1998. /cdrom/sc98/sc98/techpape/sc98full/thotteh/index.htm.
[8] Computer Architecture. A Quantitative Approach (2nd edn). Morgan Kaufmann, 1996. · Zbl 0844.68003
[9] The most important technical library in the world. http://www.nag.co.uk/other/ff98_papers/greer.html [July 1999].
[10] 3 1/2 SIMD architectures. http://arstechnica.com/cpu/1q00/simd/simd-1.html [June 2000].
[11] Automatically tuned linear algebra software V1.0. http://www.netlib.org/atlas/ [September 1998].
[12] Automated empirical optimizations of software and the ATLAS project. Technical Report, Department of Computer Sciences, University of Tennessee, Knoxville, March 2000. http://www.cs.utk.edu/?rwhaley/ATLAS/atlas.html.
[13] Intel. Streaming SIMD extensions?matrix multiplication. http://developer.intel.com/design/pentiumiii/sml/245045.htm [June 1999].
[14] 92{\cent}/MFlop s-1, ultra-large-scale neural-network training on a PIII cluster. Proceedings of SC2000, Dallas, TX, November, 2000.
[15] Motorola. MPC7400 PowerPC microprocessors. http://www.motorola.com/SPS/PowerPC/products/semiconductor/cpu/7400.html [October 1999].
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.