NERSCPowering Scientific Discovery Since 1974

The NERSC PARATEC README File

 Table of contents

 Code Description

PARATEC: Parallel Total Energy Code

General Description

The benchmark code PARATEC (PARAllel Total Energy Code) performs ab-initio quantum-mechanical total energy calculations using pseudopotentials and a plane wave basis set. Total energy minimization of electrons is done with a self-consistent field (SCF) method. Force calculations are also done to relax the atoms into equilibrium positions. PARATEC uses an all-band (unconstrained) conjugate gradient (CG) approach to solve the Kohn-Sham equations of Density Functional Theory (DFT) and obtain the ground-state electron wavefunctions. In solving the Kohn-Sham equations using a plane wave basis, part of the calculation is carried out in Fourier space, where the wavefunction for each electron is represented by a sphere of points, and the remainder is in real space. Specialized parallel three-dimensional FFTs are used to transform the wavefunctions between real and Fourier space and a key optimization in PARATEC is to transform only the non-zero grid elements. [PAR1]

The code can use optimized libraries for both basic linear algebra and Fast Fourier Tranforms but due to its global communication requirements, architectures with a poor balance between bisection bandwidth and computational rate will suffer performance degradation at higher concurrencies on PARATEC. [Oliker] Nevertheless, due to favorable scaling, high computational intensity and other optimizations, the code generally achieves a high percentage of peak performance on both superscalar and vector systems. [PAR2]

In the benchmark problems supplied here 11 conjugate gradient iterations are performed; however, a real run would typically do between 20 and 60.

Coding

PARATEC consists of about 50,000 lines of Fortran90 code. Preprocessing via m4 is used to include machine-specific routines such as the FFT calls. The version supplied uses MPI although it can also be built for a single-processor run and for SHMEM, if available.

The code typically spends about 30% of its time in BLAS3 routines and 30% in the one-dimensional FFTs on which the three-dimensional FFTs are built. The remainder of the time is spent in various other Fortran90 routines. Vendor libraries such as IBM's ESSL can be used for both the linear algebra and Fourier transform routines. However, PARATEC includes code that does 3-D FFTs via three sets of hand-written one-dimensional FFTs. Many FFTs are done at the same time to avoid latency issues and only non-zero elements are communicated/calculated; thus, these routines can be faster than vendor-supplied routines. Additional libraries used are ScaLAPACK and BLACS.

A list of files that are preprocessed and may require modification (although unlikely) is here.


NOTE Concerning Use of FFTW on 64-bit address platforms

When using the FFTW library on machines that have 64-bit addresses (ie. AMD Opteron) you must change the Fortran90 declaration for two variables in file fft_macros.m4h in the subdirectory src/macros/fft. The statement

INTEGER FFTW_PLAN_BWD, FFTW_PLAN_FWD 
must be changed to
INTEGER*8 FFTW_PLAN_BWD, FFTW_PLAN_FWD
Otherwise, the program will compile but fail with a segmentation violation in the FFTW call.

Authorship

See http://www1.nersc.gov/projects/paratec/DOC/

Relationship to NERSC Workload

A recent survey of NERSC ERCAP requests for materials science applications showed that Density Functional Theory (DFT) codes similar to PARATEC accounted for nearly 80% of all HPC cycles delivered to the Materials science community. Supported by DOE BES, PARATEC is an excellent proxy to the application requirements of the entire Materials Science community. PARATEC simulations can also be used to predict nuclear magnetic resonance shifts. Overall goal is to simulate the synthesis and predict the properties of multi-component nanosystems.

Parallelization

Each electron in a plane wave simulation is represented by a grid of points from which the wavefunction is constructed. Parallel decomposition of such a problem can be over n(g), the number of grid points per electron (typically O(100,000) per electron), n(i), the number of electrons (typically O(800) per system simulated), or n(k), a number of sampling points (O(1-10)).

PARATEC uses MPI and parallelizes over grid points, thereby achieving a fine-grain level of parallelism. In Fourier space each electron's wavefunctiongrid forms a sphere. The figure below depicts a visualization of the parallel data layout on three processors. Each processor holds several columns which are lines along the z-axis of the FFT grid. Load balancing is important because much of the compute-intensive part of the calculation is carried out in Fourier space. To get good load balancing, the columns are first assigned to processors in descending length order and then to processors containing the fewest points.

The real-space data layout of the wavefunctions is on a standard Cartesian grid, where each processor holds a contiguous part of the space arranged in columns, shown in Figure 4b. Custom three-dimensional FFTs transform betweeen these two data layouts. Data are arranged in columns as the three-dimensional FFT is performed, by taking one-dimensional FFTs along the Z, Y, and X directions with parallel data transposes between each set of one-dimensional FFTs. These transposes require global interprocessor communication and represent the most important impediment to high scalability.

Paratec Figure 1

The FFT portion of the code scales approximately as n2log(n) and the dense linear algebra portion scales approximately as n³ therefore, the overall computation-to-communication ratio scales as n, where n is the number of atoms in the simulation.

Top of File

 Obtaining the Code

For NERSC-related procurements please visit the procurement site.

To obtain the code visit the Paratec web site and see the information for intended users of PARATEC there. Registration is required.

You can download the NERSC-5 Paratec benchmark input data files here (gzip tar file, 100KB) here.

Top of File


 Building the Code

First, ./configure needs to be run in the main PARATEC directory. If you run it with no options you will be presented with a list of machine environments for which settings are already defined and you will be asked to select one. After that you will be asked two additional questions with obvious choices, i.e., whether to use MPI (as opposed to a single-processor build) and whether to use SCALAPACK (as opposed to Fortran90 routines).

The result of running the configure script is a file called conf.mk in the current working directory.

To add a new machine to the list of configure choices go into the directory ./config and generate a file called sysvar.machine_name. You can also create a "blurb.machine_name" file that contains a comment printed when ./configure is run. You also need to edit the configure script itself, adding a new literal string in the "select i" at line 26 and a new case (lines 27-45) where the values of both machine and platform are set to the "machine_name" you used for your sysvar.machine_name file.

Note that each time you modify sysvar.machine_name you must rerun ./configure.

The machines for which config files are provided are listed below.

Table 1.2:Machines currently supported
MachinemachineDescription
Sun sun Sun
IBM RS6000 rs6k Single-node IBM RS6000
IBM SP2 sp2 Glen Seaborg at NERSC
IBM SP3 (NERSC) sp3 Glen Seaborg at NERSC
IBM SP3 (NPACI) sp3 Blue Horizon in San Diego
Cray T3E t3e Cray T3E at NERSC
SGI o200 sgio200 SGI Origin 200
SGI o2000 sgio2000 SGI Origin 2000 at NCSA
SGI PowerChallenge sgipc SGI PowerChallenge at NCSA
HP/Convex Exemplar cvx  
Linux x86 I386 Linux on x86
DEC Alpha alpha DEC Alpha
Hitachi SR2201 sr2201 Hitachi Sr2201 at Cambridge
SGI o2000 hodgekin SGI Origin 2000 using new libraries

After configure is finished type "make paratec" to build the executable. For each machine, a working directory in the source tree is created in the subdirectory paratec/src/para/machine_name. If the build is successful the executable (paratec.mpi) will be in the bin/<machine_name> directory. files are placed in paratec/bin/machine.

Top of File

 Build-Related Files in this distribution

"Main" and MPI_Init are in src/para/main.f90p. A list of all files with a brief description of each may be found in the file doc/doc_purpose.

Macro files have ".m4h" suffices; Fortran90 files that don't need preprocessing have ".f90" sufficies; Fortran files that do need preprocessing have ".f90p" sufficies and are preprocessed into ".p.f90" files.

A description of the directory structure is given in the following table.

DirectoriesDescription
para main executable, paratec
- paratec
subdirectory $machine contains object files and files
after preprocessing
tools the sources for several tools such as
- kbascbin (pseudopotential from ascii to binary)
- kbbinasc (pseudopotential from binary to ascii)
subdirectory $machine contains object files
microbench A synthetic benchmark to get a performance estimate for the platform-specific FFT and Matrix-Matrix multiplies. subdirectory $machine contains object files
shared stuff shared between microbench and para
fft_opcount.f90 and some system functions
subdirectory $machine contains object files
macros m4 macro files that cover part of the platform specifics.
perllib perl scripts for modifying input files and retrieving data from the output files

A description of some important files follows.

filedescription
Makefile.mk An extension of the Makefile
conf.mk Created as output from running ./configure
configure script that creates the conf.mk file with machine and option dependent preprocessor variable definitions
vars.mk contains some preprocessor variables needed by the Makefile

Some additional information for navigating around the code is as follows. The file "constants.f90" contains all the basic Fortran90 data type declarations, e.g., all real variables in the code will use the parameter dp = kind(1.0d0). The file "structures.f90" contains the data data declarations for most of the important structures and modules in the code.

Top of File

 Running the Code

The concurrency simply equals the number of MPI tasks. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing must be equal to the number of physical processors on the node.

Invoke the application by typing, for example,

 mpirun -np #tasks paratec.mpi
or
 poe paratec.mpi

Paratec expects two files, "input" and "Si_POT.DAT" in the directory that it is executing. Copy the file "input.<size>" to "input" for the <size> required.

The important output file is "OUT." The last line contains the time for the run.

Top of File

 Timing Issues

The code is heavily instrumented for timing; the timer is called "gimmetime" and it is defined in one of the system-specific source files /scr/shared/ze_<machine_name>.f90. Note: the timing harness of interest is the one that produces the output string "NERSC_TIME" and it times the main loop in file pwmain.f90p. The intention is to measure elapsed (wallclock) time.

Top of File

 Storage Issues

Memory Required By The Sample Problems:

smallmediumlarge
Memory .256 GB from LoadLeveler) 1.25 GB from LoadLeveler) 2.0 GB (from LoadLeveler)

Top of File


 Required Runs

The directory "benchmark" contains input for 3 problem sizes, "input.", where is "small", "medium" and "large". There are also corresponding sample output files, "OUT.". The small case is only used for porting and debugging. Each problem size must be executed with a fixed concurrency as specified below. The intent of these decks is not to gauge scalability but to obtain timing data for the three distinct concurrencies.

All runs simulate silicon in the diamond structure.

smallmediumlarge
#atoms 16 250 686
Concurrency 4 64 256

A typical calculation might require between 20 and 60 CG iterations to converge the charge density.

There is a subdirectory "benchmark" in which input data files, reference output files, and sample batch submissions scripts are located.Note that PARATEC must be executed with "fully-packed" nodes, i.e. the number of processes or threads employed on each node should equal the number of physical processors available on the node.

Top of File

 Verifying Results

As many as seven different output files may be produced from the run, only one of which, OUT, is important for benchmarking purposes. A verification script, "checkout", is provided with the distribution to determine correctness of the run by comparing "OUT" with the reference "OUT.<size>." The "OUT" files for the medium and large cases should be provided to NERSC to verify the results.

 

Additionally, the configuration file ("sysvars.machine_name") used and a complete log of the build process should also be returned to NERSC for verification.

Top of File

 Modification Record

This is PARATEC Release 5.1.13b1
Top of File

 Record of Formal Questions and Answers

Top of File

 Bibliography

   [PAR1] Paratec web page http://www1.nersc.gov/projects/paratec/

   [Oliker] "Leading Computational Methods on Scalar and Vector HEC Platforms," Proceedings of SC|05 November 12-18, 2005 Seattle Washington USA.

   [PAR2] "Scaling First Principles Materials Science Codes to Thousands of Processors. " crd.lbl.gov/DOEresources/ SC04/Canning_Nanoscience_SC04.pdf

Top of File