Homework 3 (10/22 - 11/15): Parallel Preconditioned Conjugate Gradients
[ Intro
] [ Details
] [ Submit ] [ Grades ] [ References
]
[ Discussion ] [ FAQ ]
[ Millennium Essentials ] [ NERSC Essentials ]
The method of conjugate gradients (CG) is an iterative technique for solving symmetric positive-definite linear systems. The conjugate gradient algorithm, popular in practice, is similar in structure to many other linear and nonlinear optimization and equation-solving algorithms, and is relatively simple to code. All these points make CG an attractive benchmark kernel. Indeed, CG appears in both the NAS parallel benchmarks and the SPEC floating point benchmark suite.
Read discussion notes for more details of the assignment and description of the mathematics.
You may work in groups of up to 3.
The implementations we give you are capable of solving a simple model problem (the 1-d Poisson equation) or more general sparse matrix problem. The code can read sparse matrix files in the Matrix Market format; for instancecg gr_30_30.mtxwill solve the 30-by-30 two-dimensional Poisson problem from the Matrix Market file gr_30_30.mtx.We provide a serial implementation in C and Titanium (readme). Examine and profile one of the serial codes on a platform of your choice (does not have to be Millennium or seaborg). If you choose C for this portion, you may still choose Titanium for your parallel version (or visa versa). Familiarize yourself with the implementation of the CG algorithm. What operation(s) seem to take a long time? What kind of pattern do the memory accesses follow? What might these observations tell you about how to parallelize the code? Try downloading some other files from the Matrix Market. How are the memory access patterns changed with a different matrix structure?
- serial C
- serial Titanium (links to a readme on the titanium code)
Select either Millennium or seaborg (or your own if you are using UPC) as the parallel machine on which you want to optimize CG.
We provide a parallel implementation in UPC (warning) and Titanium (readme). Select either UPC (warning) or Titanium as the language in which you wish to parallelize CG. Our expectations for UPC optimizations and extensions will be less than our expectations for Titanium optimizations and extensions. We provide initial Titanium and UPC code. If you are feeling ambitious, choose a different parallel language and implement your own CG algorithm.
- parallel UPC (warning)
- parallel Titanium (links to a readme on the titanium code)
Run, time, and evaluate your chosen CG code on your chosen platform. Are there obvious bottlenecks? What do your timings tell you about where to target your improvements?
Do some optimizations. Possible targets of optimization include the serial performance of the sparse matrix-vector multiply (or other parts of the code) and the communication patterns. Experiment with different partitioning schemes as discussed in lecture.Extend the program. Possible extensions include optimizations targeted toward matrices with a specific structure or implementation of a differently-organized version of CG. For re-ordering the matrix, you may try to use the RCM (reverse cuthill McKee) algorithm to improve locality for the matrix vector multiplication. Pick something that looks interesting, and if you are unsure about it, just ask. If you are ambitious, you may try implementing a better preconditioner for some problem of interest. How does changing the preconditioner change the iteration count, the time each iteration takes, and the overall processing time?
If you have time, try compiling your code on the other platform and see if it has the same performance characteristics. Why might they be different/similar?
UPC Warning: There does not seem to be support for UPC currently on either Millennium or seaborg. If you have access to a parallel system that can compile UPC code, you may use that system.
Titanium: If you want more discussion on Titanium, possibly see last year's slides.
Your group needs to submit your code, along with a write-up describing what you did and why you did it. Tell us about the modifications you made to the code; but more importantly, tell us about any profiling, experiments, models, or other considerations that lead you to make your changes.
A web page for your write-up would be preferable, but we will accept other formats, too. When you are ready, e-mail your submission (or a pointer to it) to the TA.
Grades will be based on your explanations and problem-solving attempts. The optimization and extension guidelines are flexible. If you mainly focus on interesting extensions, then less expectation will be given to optimizations.
- Lenny Oliker in his lecture presented a couple of suggestions on how to improve the performance of PCG. See his papers.
- UPC at George Mason University is another place to look for UPC tutorials and information.
- LAPACK working Note #56 discusses how to reorganize CG to eliminate reductions. (Fewer global communication steps is a good thing.)
- Note that in csr_problem.c the macros MAX_N and MAX_NNZ determine the maximum allowable number of rows and the maximum number of nonzeros. If you get a failed assertion that says something about MAX_N or MAX_NNZ, you probably need to make the values of those macros larger.
- The UPC language reference and the Titanium page will be valuable references.
- Templates for the Solution of Linear Systems is a great source of information on implementing and optimizing CG, choosing a preconditioner, etc. There is also source (albeit source written in Fortran with a reverse communication interface).
- Parallel Numerical Linear Algebra by Demmel, Heath, and van der Vorst. Additional information on parallel implementation of CG.
- An Introduction to the Conjugate Gradient Method Without the Agonizing Pain is a very popular (and very gentle) introduction to the mathematics of CG. If you have vague recollections of eigenthingies from your distant past, you can probably grok this explanation.
- The PETSc toolkit includes CG and several of its relatives, along with a variety of preconditioners. You might also be interested in ITPACK, a package of iterative solvers that includes CG.
- The Matrix Market is a great source for test cases. Some of the suggested test cases from previous years included gr_30_30, nos1, nos2, nos3, nos4, nos5, nos6, nos7 (all smallish), and bcsstk16, bcsstk17, bcsstk18 (larger). These look pretty arbitrarily chosen; other SPD problems from the Matrix Market could be interesting, too. Note that you want the Matrix Market format files, not Harwell-Boeing format.
- Sparsity is a package by Eun-Jin Im for optimizing sparse matrix-vector multiply. It focuses on unsymmetric problems; you may be able to get some additional benefit from the symmetry.
- Something like METIS could be helpful for partitioning your problem across processors. I know METIS and ParMETIS are installed on the SP; we'll have to find out if there exist interfaces for Titanium.
- The C and UPC implementations call LAPACK (or equivalently CLAPACK on the Millennium) to do the Cholesky factorization and solves for the block Jacobi preconditioner. LAPACK relies on the BLAS to do most of its work. If you are running an RPM-friendly Linux distribution, you can grab RPM packages. Note that the reference BLAS library provided with LAPACK gets crappy performance. You really want something like ATLAS if you plan to get good performance from LAPACK and the BLAS.
- You may notice that for slowly-converging cases, the behavior is different for different numbers of processors even when the code appears mathematically identical. Look at the residual histories when this happens. Do they start of the same, and then start to drift away from each other? If so, you may be seeing certain round-off effects, probably due to differences in ordering of sums in the dot product.
[ Assignment ] [ Compilation ] [ Execution ] [ Miscellaneous ]
UPC?
UPC is not ready on Seaborg or Millennium, so this option is unsupported. If you know of a machine that runs UPC and can do performance testing on that machine, then those results will be counted. Remember that you can choose to implement CG in a different parallel language.
Fresh hand-picked matrices?
http://math.nist.gov/MatrixMarket/data/misc/xlatmr/matgen.html - up to size 250. -- thanks to Scott Hafeman
Reducing bandwidth of sparse matrices? Try ParMETIS:
http://www-users.cs.umn.edu/~karypis/metis/parmetis/ -- thanks to Sukun Kim
Other software you may want to try:
TI-PETSC - http://www.cs.berkeley.edu/~liblit/ti-petsc/ -- thanks to Wim Vanroose (it unfortunately does not work on seaborg or Millennium)
The ACTs toolkit on seaborg provides PETSc. Have a look at http://acts.nersc.gov/petsc/main.html .
CSR.ti has an incorrect implementation of toString -- Wim Vanroose
public String toString () {
String result = "";
int n = rowStarts.domain().size()-1;
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
int flag=-1;
for(int curr = rowStarts[i];
curr < rowStarts[i+1]; curr++) {
if ( rowIndex[curr] == j) {
flag = curr;
}
}
if ( flag != -1) {
result += data[flag] + "\t";
}
else {
result += "0\t";
}
}
result += "\n";
}
return result;
}