Homework 1 (9/5 - 9/20): Matrix Multiply [ Results ]

[ Groups ] [ Brief Summary ] [ Overview ] [ Details ] [ Submission ] [ Grading ]
[ Processors ] [ References ]

[ Millennium Essentials ]

Groups

The groups were chosen rather arbitrarily (and somewhat in alphabetical order) to combine people from different departments and to ensure there is someone familiar with application development or coding in the group. In future assignments, you will choose your own groups, but for now, we would like to shuffle people together. Grading will take into account group sizes. Please send your information if you have not done so yet. Sorry, in advance, for any lost emails.

Barad, Cheung, Gonzalez

Edelson, Gies, Griem

Hafeman, Kim

Lam, Patra, Vanroose

Moussa, Wu, Zhou

Wemhoff, Zhuang, Zargar

W. Chen, Liu, Zhu

Yue, Parekh, Nedevschi

K. Chen?, Nie

Brief Summary

Optimize sequential code for square matrix multiply (aka dgemm: C = C + A B where A, B, and C have dimension m by m) on the Millennium Katmai P3 550 MHz processor and run the tuned code on a different machine of your choice. Explain the optimizations you use and the performance characteristics (e.g. cache sizes, latencies) of the Katmai processor. What happens when you run the code on a different machine? Why?

The class performance results for the Katmai processor will be compared based on chosen values of m (not disclosed until after deadline).

Overview

Matrix multiplication is a basic building block in many scientific computations; and since it is an O(n3) algorithm, these codes often spend much of their time in matrix multiplication. The most naive code to multiply matrices is short, sweet, simple, and very slow:

  for i = 1 to m
    for j = 1 to m
      for k = 1 to m
        C(i,j) = C(i,j) + A(i,k) * B(k,j)
      end
    end
  end 

We're providing implementations of the trivial unoptimized matrix multiply code in C and Fortran, and a very simple blocked implementation in C. We're also providing a wrapper for the BLAS (Basic Linear Algebra Subroutines) library. The BLAS is a standard interface for building blocks like matrix multiplication, and many vendors provide optimized versions of the BLAS for their platforms. You may want to compare your routines to codes that others have optimized; if you do, think about the relative difficulty of using someone else's library to writing and optimizing your own routine. Knowing when and how to make use of standard libraries is an important skill in building fast programs!

The unblocked routines we provide do as poorly as you'd expect, and the blocked routine provided doesn't do much better.

The performance of the two routines is shown above. The naive unblocked routine is the "B=1" curve, and is just the standard three nested loops. The blocked code (block size of 30, selected as the "best" in an exhaustive search of square tile sizes up to B=256) is shown in green. The peak performance of the machine used is 667 MFlop/s, and both implementations achieve less than a quarter of that performance. However, the unblocked routine's performance completely dies on matrices larger than 255 x 255, while the blocked routine gives more consistent performance for all the tested sizes. Note the performance dips at powers of 2 (cache conflicts). The machine used for this diagram has a 2 Mbyte external cache, and 3 matrices * 256x256 entries * 8 bytes/entry is 1.5 Mbytes. Obviously, there's overhead in both operation count and memory usage.

Details

The code is distributed in tgz format and directly.

You need to write a dgemm.c that contains a function with the following C signature:

      void
      square_dgemm (const unsigned M, 
                    const double *A, const double *B, double *C)
If you would prefer, you can also write a Fortran function in a file fdgemm.f:
        subroutine sdgemm(M, A, B, C)
c
c       .. Parameters ..
        integer M
        double precision A(M,M)
        double precision B(M,M)
        double precision C(M,M)
Note that the matrices are stored in column-major order. Also, your program will actually be doing a multiply and add operation:
    C := C + A*B

We will be testing on the 550 MHz Pentium 3 (Katmai) nodes. To restrict what servers are used by gexec, you can set the GEXEC_SVRS environment variable. For instance, to use mm61 through mm65, I would set GEXEC_SVRS to "mm61 mm62 mm63 mm64 mm65". See a brief intro to the Millennium system here.

The y-axis in the gnuplot script plots is labeled MFlop/s, but really it should be "MFlop/s assuming you were using the standard algorithm." If you use something like Strassen's, we will still measure time/(2n3).

You will get a lot of improvement from tiling, but you may want to play with other optimizations, too. Strassen-like algorithms, copy optimizations, recursive data layout, ... you can get some pretty elaborate optimization strategies. We'll cover some of the possibilities in discussion, but you might want to do some independent reading, too.

Submission

Email your group's tarred write-up and code, or create a webpage that has them. The write-up should contain:

To show the results of your optimizations, include a graph comparing your dgemm.c with the included basic_dgemm.c. For the last requirement, try your tuned dgemm.c on another hardware platform (like the NoW machines or Millennium or your home machine) and explain why it performs poorly. Your explanations should rely heavily on knowledge of the memory hierarchy. (Benchmark graphs help.)

Grading

Grading will be based on the analysis and description of different algorithms attempted and the performance comparisons done (graphs would be helpful) as well as the performance that the code achieves on the Katmai processors.

Millennium Katmai 550 MHz Processors

mm42 mm43 mm45 mm46 mm48 mm61 mm62 mm63 mm64 mm65 mm66 mm67 mm68 mm69 mm70 mm71 mm73 mm76 mm77 mm80

References