The NERSC MADBench README File
MADBench: Microwave Anisotropy Dataset Computational Analysis Package Benchmark
The benchmark code MADBench is a "stripped-down" version of MADCAP, a Microwave Anisotropy Dataset Computational Analysis Package. This package contains several computational tools developed for analysis of data from Cosmic Microwave Background (CMB) experiments. [CMB1] [CMB2] [CMB3] The goal of these experiments is to extract the wealth of cosmology information embedded in the CMB that is related to the universe at an age of about 400,000 years after the Big Bang. Such experiments typically involve scanning a significant amount of the sky for long periods at very high resolution. The reduction of the resulting datasets, first to a pixelified sky map and then to an angular power spectrum, is extremely computationally intensive.
MADBench is a version of MADCAP designed specifically for benchmarking. It preserves both the computational and I/O challenges of the original code; yet it reduces approximately 10,000 lines of code to less than 1,000. Additionally, it requires no input data files. The primary computational challenge is an out-of-core solution to a dense linear algebra problem on distributed matrices. Only one I/O function is tested in MADBench and each process reads/writes to its own single disk file.
The MADBench benchmark has been further developed into a "MADBench2" benchmark [MADBench2] that can be run in a "full" mode or an "I/O" mode, the latter of which replaces all computation with dummy work and just measures read and write times. This can be useful for I/O and/or filesystem benchmarking and tuning. However, only "full" mode is used here.
MADCAP recasts the extraction of a CMB power spectrum from a sky map into a problem of dense linear algebra and exploits ScaLAPACK libraries for efficient solution. The computation estimates the angular power spectrum using a Newton-Raphson iteration to locate the peak of the spectral likelihood function. This involves calculating two derivatives of the likelihood function with respect to power spectrum coefficients. Since the data typically lack sufficient sky coverage and/or resolution to obtain individual multipole coefficients the code bins them instead.
Although the pixel-pixel correlation matrices are huge the MADCAP/MADBench implementation has no more than three matrices in memory at any one time. The operational steps and algorithmic complexity for a map with NPIX pixels and NMPL multipoles in NBIN bins are [MADBench] :
- Calculate signal correlation derivative matrices, O(NMPL X NPIX²).
- Form and then invert the data correlation matrix, O(NPIX³).
- For each bin, form the product of the result of steps 1 and 2, O(NBIN X NIPX³).
- For each bin, calculate the first derivative of the likelihood function with respect to bin coefficients, O(NBIN X NIPX²).
- For each bin-bin pair, calculate the second derivative, O(NBIN² X NPIX²).
- Invert the bin correlation matrix and calculate the spectral correction, O(NBIN³).
Steps 1-3 are computationally dominant because NBIN << NMPL << NPIX.
MADbench consists of a single source file with about 600 lines of C code and a single header file containing macro definitions to adjust C to Fortran linkage conventions for library calls. It implements only the dominant computational steps (1-3 above) plus a simplified version of steps 4-6 to confirm run correctness. See the Parallelization Section, below, for more details.
The following BLAS routines are used:
There are two versions of the dSdC routine included in the distribution. One (dSdCv) is vectorizable and may be used whenever it is faster than the default. If you use this one, check that the appropriate pragmas are activated.
64-bit floating-point precision is used.
Relationship to NERSC Workload
Astrophysics. Used a large amount of time in FY04. Is one of the highest scaling codes in the NERSC workload.
All pixel-pixel correlation matrices are ScaLAPACK block-cyclic distributed so each processor owns a unique subset of each matrix and each processor writes/reads this same subset to its own disk file. Although this adds simplicity when using distributed filesystems it can result in interconnect or network contention. Therefore, the code has the ability to reduce the number of processors performing concurrent read and write I/O to user-specified fractions of the total (via input variables RMOD and WMOD, respectively). It uses a token passing scheme with MPI barriers to implement and requires experimentation to optimize these parameters - as well as the ScaLAPACK block size (BSIZE) - on each architecture.
The six MADCAP computational steps outlined above become the following 3 steps in MADBench:
- dSdC For each bin calculate the processor-local dSdC derivative matrix and write it to disk. This step involves neither interprocessor communication nor reading.
- invD For each bin read in the processor-local dSdC derivative matrix and weight-accumulate it to build the processor-local correlation matrix D; then, invert this by Cholesky decomposition and triangular solution using the ScaLAPACK pdpotrf and pdpotri routines. This step involves no writing.
- W For each bin read in the processor-local dSdC derivative matrix and perform the dense matrix-matrix multiplication of it with the inverted matrix from step 2 using the ScaLAPACK pdgemm routine. This step involves no writing.
The matrix-matrix multiplications in step 3 are entirely independent of one another so two implementations are possible. The first would proceed as above but the second would introduce a level of gang-parallelism, with NGANG of the dSdC matrices being remapped to a different subset of processors using the ScaLAPACK pdgemr2d function and all of the processor "gangs" simultaneously calling pdgemm
Only weak scalability is of importance.
For NERSC-related procurements please visit the procurement site.
To obtain the code otherwise contact Julian Borrill
The code is contained in a single C file with one header file. A single makefile is provided with build options defined for NEC, CRAY X1, SGI Altix, and IBM Pentium-III, all but one of which should be commented in order to use.
Several libraries are used: ScaLAPACK, lapack, BLACS; on some systems ACML, ESSL.
The file madbench.h may be edited to adjust C to Fortran linkage conventions, in particular, trailing underscores.
To build, just type "make".Top of File
|madbench.h||#defines to help with different calling conventions between Fortran and C|
|makefile||makefile with different machine definitions|
|benchmark/small.ref||sample output for small problem|
|benchmark/medium.ref||sample output for medium problem|
|benchmark/large.ref||sample output for large problem|
|benchmark/verify||shell script to verify correct answers|
|benchmark/small.ll||loadleveler script to batch submit small problem|
|benchmark/small.pbs||PBS script to batch submit small problem|
The program takes 6 input arguments. There is no other input. The six input arguments are:
|NO_PIX||Number of Pixels|
|NO_BIN||Number of multipole bins|
|NO_GANG||Number of independent work gangs to divide the processors into.|
|BLOCKSIZE||Blocksize used in ScaLAPACK operations.|
|R_MOD||Number of processes concurrently reading data.|
|W_MOD||Number of processes concurrently writing data.|
Invoke the application by typing, for example,
mpirun -n 64 madbench.x 18000 24 1 128 1 1 > medium.out
You may alter BLOCKSIZE, R_MOD, and W_MOD to get the fastest execution time on your platform.
The other 3 parameters vary according to the test. See the Required Runs section.Top of File
Elapsed (wall clock) time is done via the MPI W_time function. The run time of interest is printed as the line labelled "NERSC_TIME."Top of File
Memory required is minimal (16 arrays x 12.5 Mb data per array per CPU). Disk storage can be considerable; note that the large test takes around 100GB of disk storage and the xlarge around 200GB.
It is important to size MADBench storage so that the system's block buffer cache does not cache all the I/O. The file storage must be greater than the total memory available on all nodes on which the code is run and should preferrably be several times greater.Top of File
|small||(4 mpi processes)||NO_PIX=5000||NO_BIN=16||NO_GANG=1|
|medium||(64 mpi processes)||NO_PIX=18000||NO_BIN=24||NO_GANG=1|
|large||(256 mpi processes)||NO_PIX=32000||NO_BIN=16||NO_GANG=1|
|xlarge||(1024 mpi processes)||NO_PIX=40000||NO_BIN=16||NO_GANG=16|
There is a subdirectory "benchmark" in which input data files and sample batch submissions scripts are located. Four sample input files for four different size runs are provided: "small", "medium," "large," and "xl" (extra-large). The small case should be used for testing. The benchmark timings are required for the medium,large and xl cases. Each case must be executed with a fixed concurrency.Top of File
Sample output files (named <size>.ref) from the NERSC IBM SP are provided. You should check to see if the value printed for dC agrees with that of the reference calculations. A script called "verify" in the "benchmark" subdirectory is provided to do this. To run it type
The script has correct values hard-coded in it for the small, medium, large and extra-large cases but it doesn't print "OK" or "Failed." You must verify that the calculated value differs by less 1.e-5 from the "correct" value.Top of File
- This is MADBench Version N5L February 2005
[CMB1] Cosmic Microwave Background Analysis Tools http://crd.lbl.gov/~borrill/cmb/combat
[CMB2] Detailed measurement of the cosmic microwave background radiation (CMB) http://www.lbl.gov/CS/Archive/headlines4-26-00.html
[CMB3] CMB at NERSC http://crd.lbl.gov/~borrill/cmb/nersc/
[MADBench2] Integrated Performance Monitoring of a Cosmology Application on Leading HEC Platforms, http://crd.lbl.gov/~oliker/papers/ICPP_2005.pdf
[MADBench] Reconfigurable Hybrid Interconnection for Static and Dynamic Scientific Applications, http://repositories.cdlib.org/lbnl/LBNL-58001
Top of File