The NERSC PARATEC README File
- Code Description
- Obtaining the Code
- Building the Code
- Files in this distribution
- Running the Code
- Timing Issues
- Storage Issues
- Required Runs
- Verifying Results
- Modification Record
- Record of Formal Questions and Answers
PARATEC: Parallel Total Energy Code
The benchmark code PARATEC (PARAllel Total Energy Code) performs ab-initio quantum-mechanical total energy calculations using pseudopotentials and a plane wave basis set. Total energy minimization of electrons is done with a self-consistent field (SCF) method. Force calculations are also done to relax the atoms into equilibrium positions. PARATEC uses an all-band (unconstrained) conjugate gradient (CG) approach to solve the Kohn-Sham equations of Density Functional Theory (DFT) and obtain the ground-state electron wavefunctions. In solving the Kohn-Sham equations using a plane wave basis, part of the calculation is carried out in Fourier space, where the wavefunction for each electron is represented by a sphere of points, and the remainder is in real space. Specialized parallel three-dimensional FFTs are used to transform the wavefunctions between real and Fourier space and a key optimization in PARATEC is to transform only the non-zero grid elements. [PAR1]
The code can use optimized libraries for both basic linear algebra and Fast Fourier Tranforms but due to its global communication requirements, architectures with a poor balance between bisection bandwidth and computational rate will suffer performance degradation at higher concurrencies on PARATEC. [Oliker] Nevertheless, due to favorable scaling, high computational intensity and other optimizations, the code generally achieves a high percentage of peak performance on both superscalar and vector systems. [PAR2]
In the benchmark problems supplied here 11 conjugate gradient iterations are performed; however, a real run would typically do between 20 and 60.
PARATEC consists of about 50,000 lines of Fortran90 code. Preprocessing via m4 is used to include machine-specific routines such as the FFT calls. The version supplied uses MPI although it can also be built for a single-processor run and for SHMEM, if available.
The code typically spends about 30% of its time in BLAS3 routines and 30% in the one-dimensional FFTs on which the three-dimensional FFTs are built. The remainder of the time is spent in various other Fortran90 routines. Vendor libraries such as IBM's ESSL can be used for both the linear algebra and Fourier transform routines. However, PARATEC includes code that does 3-D FFTs via three sets of hand-written one-dimensional FFTs. Many FFTs are done at the same time to avoid latency issues and only non-zero elements are communicated/calculated; thus, these routines can be faster than vendor-supplied routines. Additional libraries used are ScaLAPACK and BLACS.
A list of files that are preprocessed and may require modification (although unlikely) is here.
NOTE Concerning Use of FFTW on 64-bit address platforms
When using the FFTW library on machines that have 64-bit addresses (ie. AMD Opteron) you must change the Fortran90 declaration for two variables in file fft_macros.m4h in the subdirectory src/macros/fft. The statement
INTEGER FFTW_PLAN_BWD, FFTW_PLAN_FWDmust be changed to
INTEGER*8 FFTW_PLAN_BWD, FFTW_PLAN_FWDOtherwise, the program will compile but fail with a segmentation violation in the FFTW call.
Relationship to NERSC Workload
A recent survey of NERSC ERCAP requests for materials science applications showed that Density Functional Theory (DFT) codes similar to PARATEC accounted for nearly 80% of all HPC cycles delivered to the Materials science community. Supported by DOE BES, PARATEC is an excellent proxy to the application requirements of the entire Materials Science community. PARATEC simulations can also be used to predict nuclear magnetic resonance shifts. Overall goal is to simulate the synthesis and predict the properties of multi-component nanosystems.
Each electron in a plane wave simulation is represented by a grid of points from which the wavefunction is constructed. Parallel decomposition of such a problem can be over n(g), the number of grid points per electron (typically O(100,000) per electron), n(i), the number of electrons (typically O(800) per system simulated), or n(k), a number of sampling points (O(1-10)).
PARATEC uses MPI and parallelizes over grid points, thereby achieving a fine-grain level of parallelism. In Fourier space each electron's wavefunctiongrid forms a sphere. The figure below depicts a visualization of the parallel data layout on three processors. Each processor holds several columns which are lines along the z-axis of the FFT grid. Load balancing is important because much of the compute-intensive part of the calculation is carried out in Fourier space. To get good load balancing, the columns are first assigned to processors in descending length order and then to processors containing the fewest points.
The real-space data layout of the wavefunctions is on a standard Cartesian grid, where each processor holds a contiguous part of the space arranged in columns, shown in Figure 4b. Custom three-dimensional FFTs transform betweeen these two data layouts. Data are arranged in columns as the three-dimensional FFT is performed, by taking one-dimensional FFTs along the Z, Y, and X directions with parallel data transposes between each set of one-dimensional FFTs. These transposes require global interprocessor communication and represent the most important impediment to high scalability.
The FFT portion of the code scales approximately as n2log(n) and the dense linear algebra portion scales approximately as n³ therefore, the overall computation-to-communication ratio scales as n, where n is the number of atoms in the simulation.
For NERSC-related procurements please visit the procurement site.
To obtain the code visit the Paratec web site and see the information for intended users of PARATEC there. Registration is required.
You can download the NERSC-5 Paratec benchmark input data files here (gzip tar file, 100KB) here.
First, ./configure needs to be run in the main PARATEC directory. If you run it with no options you will be presented with a list of machine environments for which settings are already defined and you will be asked to select one. After that you will be asked two additional questions with obvious choices, i.e., whether to use MPI (as opposed to a single-processor build) and whether to use SCALAPACK (as opposed to Fortran90 routines).
The result of running the configure script is a file called conf.mk in the current working directory.
To add a new machine to the list of configure choices go into the directory ./config and generate a file called sysvar.machine_name. You can also create a "blurb.machine_name" file that contains a comment printed when ./configure is run. You also need to edit the configure script itself, adding a new literal string in the "select i" at line 26 and a new case (lines 27-45) where the values of both machine and platform are set to the "machine_name" you used for your sysvar.machine_name file.
Note that each time you modify sysvar.machine_name you must rerun ./configure.
The machines for which config files are provided are listed below.
|IBM RS6000||rs6k||Single-node IBM RS6000|
|IBM SP2||sp2||Glen Seaborg at NERSC|
|IBM SP3 (NERSC)||sp3||Glen Seaborg at NERSC|
|IBM SP3 (NPACI)||sp3||Blue Horizon in San Diego|
|Cray T3E||t3e||Cray T3E at NERSC|
|SGI o200||sgio200||SGI Origin 200|
|SGI o2000||sgio2000||SGI Origin 2000 at NCSA|
|SGI PowerChallenge||sgipc||SGI PowerChallenge at NCSA|
|Linux x86||I386||Linux on x86|
|DEC Alpha||alpha||DEC Alpha|
|Hitachi SR2201||sr2201||Hitachi Sr2201 at Cambridge|
|SGI o2000||hodgekin||SGI Origin 2000 using new libraries|
After configure is finished type "make paratec" to build the executable. For each machine, a working directory in the source tree is created in the subdirectory paratec/src/para/machine_name. If the build is successful the executable (paratec.mpi) will be in the bin/<machine_name> directory. files are placed in paratec/bin/machine.Top of File
"Main" and MPI_Init are in src/para/main.f90p. A list of all files with a brief description of each may be found in the file doc/doc_purpose.
Macro files have ".m4h" suffices; Fortran90 files that don't need preprocessing have ".f90" sufficies; Fortran files that do need preprocessing have ".f90p" sufficies and are preprocessed into ".p.f90" files.
A description of the directory structure is given in the following table.
|para||main executable, paratec
subdirectory $machine contains object files and files
|tools||the sources for several tools such as
- kbascbin (pseudopotential from ascii to binary)
- kbbinasc (pseudopotential from binary to ascii)
subdirectory $machine contains object files
|microbench||A synthetic benchmark to get a performance estimate for the platform-specific FFT and Matrix-Matrix multiplies. subdirectory $machine contains object files|
|shared||stuff shared between microbench and para
fft_opcount.f90 and some system functions
subdirectory $machine contains object files
|macros||m4 macro files that cover part of the platform specifics.|
|perllib||perl scripts for modifying input files and retrieving data from the output files|
A description of some important files follows.
|Makefile.mk||An extension of the Makefile|
|conf.mk||Created as output from running ./configure|
|configure||script that creates the conf.mk file with machine and option dependent preprocessor variable definitions|
|vars.mk||contains some preprocessor variables needed by the Makefile|
Some additional information for navigating around the code is as follows. The file "constants.f90" contains all the basic Fortran90 data type declarations, e.g., all real variables in the code will use the parameter dp = kind(1.0d0). The file "structures.f90" contains the data data declarations for most of the important structures and modules in the code.Top of File
The concurrency simply equals the number of MPI tasks. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing must be equal to the number of physical processors on the node.
Invoke the application by typing, for example,
mpirun -np #tasks paratec.mpior
Paratec expects two files, "input" and "Si_POT.DAT" in the directory that it is executing. Copy the file "input.<size>" to "input" for the <size> required.
The important output file is "OUT." The last line contains the time for the run.Top of File
The code is heavily instrumented for timing; the timer is called "gimmetime" and it is defined in one of the system-specific source files /scr/shared/ze_<machine_name>.f90. Note: the timing harness of interest is the one that produces the output string "NERSC_TIME" and it times the main loop in file pwmain.f90p. The intention is to measure elapsed (wallclock) time.Top of File
Memory Required By The Sample Problems:
|Memory||.256 GB from LoadLeveler)||1.25 GB from LoadLeveler)||2.0 GB (from LoadLeveler)|
The directory "benchmark" contains input for 3 problem sizes, "input.", where is "small", "medium" and "large". There are also corresponding sample output files, "OUT.". The small case is only used for porting and debugging. Each problem size must be executed with a fixed concurrency as specified below. The intent of these decks is not to gauge scalability but to obtain timing data for the three distinct concurrencies.
All runs simulate silicon in the diamond structure.
A typical calculation might require between 20 and 60 CG iterations to converge the charge density.
There is a subdirectory "benchmark" in which input data files, reference output files, and sample batch submissions scripts are located.Note that PARATEC must be executed with "fully-packed" nodes, i.e. the number of processes or threads employed on each node should equal the number of physical processors available on the node.Top of File
As many as seven different output files may be produced from the run, only one of which, OUT, is important for benchmarking purposes. A verification script, "checkout", is provided with the distribution to determine correctness of the run by comparing "OUT" with the reference "OUT.<size>." The "OUT" files for the medium and large cases should be provided to NERSC to verify the results.
Additionally, the configuration file ("sysvars.machine_name") used and a complete log of the build process should also be returned to NERSC for verification.Top of File
- This is PARATEC Release 5.1.13b1
[PAR1] Paratec web page http://www1.nersc.gov/projects/paratec/
[PAR2] "Scaling First Principles Materials Science Codes to Thousands of Processors. " crd.lbl.gov/DOEresources/ SC04/Canning_Nanoscience_SC04.pdfTop of File