The NERSC MILC README File
Table of Contents
MILC: MIMD Lattice Computation
The benchmark code MILC represents part of a set of codes written by the MIMD Lattice Computation (MILC) collaboratoration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. "Strong interactions" are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. [MILC]
The MILC collaboration has produced application codes to study several different QCD research areas, only one of which, ks_dynamical simulations with conventional dynamical Kogut-Susskind quarks, is used here.
QCD discretizes space and evaluates field variables on sites and links of a regular hypercube lattice in four-dimensional space time. Each link between nearest neighbors in this lattice is associated with a 3-dimensional SU(3) complex matrix for a given field. [HOLMGREN] The version of MILC used here uses matrices ranging in size from 84 to 1284.
The MILC code has been optimized to achieve high efficiency on cache-based superscalar processors. Both ANSI standard C and assembler-based codes for several architectures are provided.
Lines of C code:
|Directory||# Files||Total Lines|
QCD involves integrating an equation of motion for hundreds or thousands of time steps that requires inverting a large, sparse matrix at each step of the integration. The sparse matrix problem is solved using a conjugate gradient method but because the linear system is nearly singular many CG iterations are required for convergence. Within a processor the four-dimensional nature of the problem requires gathers from widely separated locations in memory. The matrix in the linear system being solved contains sets of complex 3-dimensional "link" matrices, one per 4-D lattice link but only links between odd sites and even sites are non-zero. The inversion by CG requires repeated three-dimensional complex matrix-vector multiplications, which reduces to a dot product of three pairs of three-dimensional complex vectors. The code separates the real and imaginary parts, producing six dot product pairs of six-dimenaionl real vectors. Each such dot product consists of five multiply-add operations and one multiply. [GOTTLIEB]
Relationship to NERSC Workload
MILC has widespread physics community use and a large allocation of resources on NERSC systems. It supports research that addresses fundamental quesitons in high energy and nuclear physics.
The primary parallel programing model for MILC is a 4-D domain decomposition with each MPI process assigned an equal number of sublattices of contiguous sites. In a four-dimensional problem each site has eight nearest neighbors.
MILC is normally used in a weak scalability mode and the four input files supplied with this distribution implement this.Top of File
The entire code as used in the NERSC-5 procurement, with all data files and instructions, is available here (gzip tar file).
If your compiler is not ANSI compliant, try using the gnu C compiler gcc instead. Note that if the library code is compiled with gcc the application directory code must also be compiled with gcc, and vice versa. This is because gcc understands prototypes and some other C compilers don't, and they therefore pass float arguments differently. We recommend gcc. The code can also be compiled using a C++ compiler but it uses no exclusively C++ constructs.
|generic||Make_template||Included by others;
do not modify
|generic_ks||Make_template||Included by others;
do not modify
|generic_ks||Make_sample||Included by others;
do not modify
|include||Make_template||Included by others;
do not modify
|ks_imp_dyn2||Make_template||Included by others;
do not modify
|ks_imp_dyn2||Make_sample||For generating sample ouptut from trusted code;
do not modify
|ks_imp_dyn2||Make_test||For testing code by comparing test output with sample output;
do not modify
|ks_imp_dyn2||Make_time||Do not use/modify|
|ks_imp_dyn2||Make_vanilla||Do not use/modify|
|Any of these can be modified and used.|
Note regarding Makefiles. In the libraries directory you will find eight makefiles of the form Make_XXXX. Seven of these are appropriate for specific architectures for which some assembly coded routines exist. One, Make_vanilla, builds the pure C versions and should work on all architectures with a suitable choice of the Makefile variables CFLAGS and CC. The remaining file, Make_template, defines rules and macros common to all architectures, intended to be an include file for other Makefiles. It should not and cannot be used by itself!
The following simple steps are required to build the code.
- Determine if you should include the preprocessing directive "-DNATIVEDOUBLE" in both the library and ks_imp_dyn2 subdirectories.
- cd to the libraries subdirectory, edit a makefile (Make_*) similar to the target platform. Change compiler and flags as appropriate. Type: make -f Make_<target> su3.a complex.a which will produce two library archives. The library complex.a contains routines for operations on complex numbers. See `complex.h' for a summary. The library su3.a contains routines for operations on SU3 matrices,3 element complex vectors, and Wilson vectors (12 element complex vectors). See `su3.h' for a summary. None of the library routines involve communication so a sequential compiler, i.e., one not involving MPI "wrapper" scripts, can be used. The build of these libraries is entirely contained within the libraries subdirectory.
- cd to the ../ks_imp_dyn2 subdirectory, edit a makefile (Make_*) similar to the target platform. A suitable communication object file must be selected, usually "com_mpi.o", for MACHINE_DEP macro. Type: make -f Make_<target> su3_rmd This build will use code in three different directories: ks_imp_dyn2, generic_ks, and generic. Perhaps most important of these is the file "com_mpi.c" in the subdirectory generic. This file contains all the MPI interfaces.
- Copy the executable, "su3_rmd" to the "benchmark" directory.
There is no automatic detection of operating system done in the build.Top of File
Assembler code for some routines and some CPUs is provided. In each case a ".c" file of the same name contains a C routine which should work identically.
|P3 P4 SSE NASM||x86 + SSE with Netwide Assembler|
There are too many files to list them all here. "Main" is in ks_imp_dyn2/control.c. MPI_Init is in generic/com_mpi.c.Top of File
Benchmark code MILC must be run using 64-bit floating-point precision for all non-integer data variables.
The concurrency simply equals the number of MPI tasks. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing must be equal to the number of physical processors on the node.
Invoke the application by typing something like:
mpirun -np <#tasks> su3_rmd < size.inwhere size is small, medium, large or xl. In other words, the input file must be redirected to the standard input. The exact execution line depends on the system. Top of File
Timing of the NERSC benchmark code MILC is done via the dclock and dclock_cpu function calls made in the ks_imp_dyn2/control.c source file and defined in the generic/com_mpi.c source file. These routines return either gettimeofday or clock, respectively. For the medium and large cases, extract the elapsed run time from the line labelled "NERSC_TIME." For the benchmarks of interest to NERSC only one timing harness is used, i.e., no subportions of the code are instrumened.Top of File
Approximate Memory Required (Per MPI Process) By The Sample Problems:
|Small||To be determined|
|Medium||To be determined|
|Large||To be determined|
|XL||To be determined|
The minumum memory configuration required to run the problems in each configuration must be reported (OS + buffers + code + data + ...).Top of File
There is a subdirectory "benchmark" in which input data files and sample batch submissions scripts are located. Four sample input files for four different size runs are provided: "small", "medium," "large," and "xl" (extra-large). The small case should be used for testing. The benchmark timings are required for the medium,large and xl cases. Each case must be executed with a fixed concurrency.
|Problem Size||Concurrency||Lattice Size|
|small||4||8 x 8 x 16 x 16|
|medium||64||32 x 32 x 32 x 32|
|large||256||64 x 64 x 64 x 64|
|x1||2048||128 x 128 x 128 x 128|
Nothing in the three input files provided should be changed by the vendor. The intent of these four input decks is not to guage scalability but to obtain timing data for the four distinct concurrencies. Because of the numerics, the number of CG iterations per time step grows as problem size grows.Top of File
A sample script "checkout" in the "benchmark" subdirectory is provided to verify the correctness of the results by comparing the fermion action on the last "PBP:" line of the output file. The script has correct" values hard-coded in it for the small, medium and large cases and prints either "OK" or "Failed." The calculated value must differ by less 1.e-5 from the "correct" value in order to pass. Run the script with checkout >output_file<.Top of File
- This is MIMD Version 6
[MILC] MIMD Lattice Computation (MILC) Collaboration http://www.physics.indiana.edu/~sg/milc.html
[HOLMGREN] Performance of MILC Lattice QCD Code on Commodity Clusters http://lqcd.fnal.gov/badHonnef.pdfTop of File