PARATEC
Code Description
PARATEC: Parallel Total Energy Code
General Description
The benchmark code PARATEC (PARAllel Total Energy Code) performs abinitio quantummechanical total energy calculations using pseudopotentials and a plane wave basis set. Total energy minimization of electrons is done with a selfconsistent field (SCF) method. Force calculations are also done to relax the atoms into equilibrium positions. PARATEC uses an allband (unconstrained) conjugate gradient (CG) approach to solve the KohnSham equations of Density Functional Theory (DFT) and obtain the groundstate electron wavefunctions. In solving the KohnSham equations using a plane wave basis, part of the calculation is carried out in Fourier space, where the wavefunction for each electron is represented by a sphere of points, and the remainder is in real space. Specialized parallel threedimensional FFTs are used to transform the wavefunctions between real and Fourier space and a key optimization in PARATEC is to transform only the nonzero grid elements. [PAR1]
The code can use optimized libraries for both basic linear algebra and Fast Fourier Tranforms but due to its global communication requirements, architectures with a poor balance between bisection bandwidth and computational rate will suffer performance degradation at higher concurrencies on PARATEC. [Oliker] Nevertheless, due to favorable scaling, high computational intensity and other optimizations, the code generally achieves a high percentage of peak performance on both superscalar and vector systems. [PAR2]
In the benchmark problems supplied here 11 conjugate gradient iterations are performed; however, a real run would typically do between 20 and 60.
Coding
PARATEC consists of about 50,000 lines of Fortran90 code. Preprocessing via m4 is used to include machinespecific routines such as the FFT calls. The version supplied uses MPI although it can also be built for a singleprocessor run and for SHMEM, if available.
The code typically spends about 30% of its time in BLAS3 routines and 30% in the onedimensional FFTs on which the threedimensional FFTs are built. The remainder of the time is spent in various other Fortran90 routines. Vendor libraries such as IBM's ESSL can be used for both the linear algebra and Fourier transform routines. However, PARATEC includes code that does 3D FFTs via three sets of handwritten onedimensional FFTs. Many FFTs are done at the same time to avoid latency issues and only nonzero elements are communicated/calculated; thus, these routines can be faster than vendorsupplied routines. Additional libraries used are ScaLAPACK and BLACS.
A list of files that are preprocessed and may require modification (although unlikely) is here.
NOTE Concerning Use of FFTW on 64bit address platforms
When using the FFTW library on machines that have 64bit addresses (ie. AMD Opteron) you must change the Fortran90 declaration for two variables in file fft_macros.m4h in the subdirectory src/macros/fft. The statement
INTEGER FFTW_PLAN_BWD, FFTW_PLAN_FWD
must be changed to
INTEGER*8 FFTW_PLAN_BWD, FFTW_PLAN_FWD
Otherwise, the program will compile but fail with a segmentation violation in the FFTW call.
Authorship
See http://www1.nersc.gov/projects/paratec/DOC/
Relationship to NERSC Workload
A recent survey of NERSC ERCAP requests for materials science applications showed that Density Functional Theory (DFT) codes similar to PARATEC accounted for nearly 80% of all HPC cycles delivered to the Materials science community. Supported by DOE BES, PARATEC is an excellent proxy to the application requirements of the entire Materials Science community. PARATEC simulations can also be used to predict nuclear magnetic resonance shifts. Overall goal is to simulate the synthesis and predict the properties of multicomponent nanosystems.
Parallelization
Each electron in a plane wave simulation is represented by a grid of points from which the wavefunction is constructed. Parallel decomposition of such a problem can be over n(g), the number of grid points per electron (typically O(100,000) per electron), n(i), the number of electrons (typically O(800) per system simulated), or n(k), a number of sampling points (O(110)).
PARATEC uses MPI and parallelizes over grid points, thereby achieving a finegrain level of parallelism. In Fourier space each electron's wavefunctiongrid forms a sphere. The figure below depicts a visualization of the parallel data layout on three processors. Each processor holds several columns which are lines along the zaxis of the FFT grid. Load balancing is important because much of the computeintensive part of the calculation is carried out in Fourier space. To get good load balancing, the columns are first assigned to processors in descending length order and then to processors containing the fewest points.
The realspace data layout of the wavefunctions is on a standard Cartesian grid, where each processor holds a contiguous part of the space arranged in columns, shown in Figure 4b. Custom threedimensional FFTs transform betweeen these two data layouts. Data are arranged in columns as the threedimensional FFT is performed, by taking onedimensional FFTs along the Z, Y, and X directions with parallel data transposes between each set of onedimensional FFTs. These transposes require global interprocessor communication and represent the most important impediment to high scalability.
The FFT portion of the code scales approximately as n^{2log(n)}and the dense linear algebra portion scales approximately as n³ therefore, the overall computationtocommunication ratio scales as n, where n is the number of atoms in the simulation.
Obtaining Version 6 of the Code
To use the latest version of paratec and find build instructions please see the paratec software page. Note, the below instructions deal predominantly with the old version 5.1 of the code.
You can download the NERSC6 Paratec benchmark input data files here ( tar file).
Running the Code
The concurrency simply equals the number of MPI tasks. Computational nodes employed in the benchmark must be fullypacked, that is, the number processes or threads executing must be equal to the number of physical processors on the node.
Invoke the application by typing, for example,
mpirun np #tasks paratec.mpi
or
poe paratec.mpi
Paratec expects two files, "input" and "Si_POT.DAT" in the directory that it is executing. Copy the file "input.<size>" to "input" for the <size> required.
The important output file is "OUT." The last line contains the time for the run.
Timing Issues
The code is heavily instrumented for timing; the timer is called "gimmetime" and it is defined in one of the systemspecific source files /scr/shared/ze_<machine_name>.f90. Note: the timing harness of interest is the one that produces the output string "NERSC_TIME" and it times the main loop in file pwmain.f90p. The intention is to measure elapsed (wallclock) time.
Storage Issues
Memory Required By The Sample Problems:

small 
medium 
large 
Memory 
.256 GB (from LoadLeveler) 
1.25 GB (from LoadLeveler) 
2.0 GB (from LoadLeveler) 
Required Runs
The directory "benchmark" contains input for 3 problem sizes, "input.", where is "small", "medium" and "large". There are also corresponding sample output files, "OUT.". The small case is only used for porting and debugging. Each problem size must be executed with a fixed concurrency as specified below. The intent of these decks is not to gauge scalability but to obtain timing data for the three distinct concurrencies.
All runs simulate silicon in the diamond structure.

small 
medium 
large 
#atoms 
16 
250 
686 
Concurrency 
4 
64 
256 
A typical calculation might require between 20 and 60 CG iterations to converge the charge density.
There is a subdirectory "benchmark" in which input data files, reference output files, and sample batch submissions scripts are located.Note that PARATEC must be executed with "fullypacked" nodes, i.e. the number of processes or threads employed on each node should equal the number of physical processors available on the node.
Verifying Results
As many as seven different output files may be produced from the run, only one of which, OUT, is important for benchmarking purposes. A verification script, "checkout", is provided with the distribution to determine correctness of the run by comparing "OUT" with the reference "OUT.<size>." The "OUT" files for the medium and large cases should be provided to NERSC to verify the results.
Additionally, the configuration file ("sysvars.machine_name") used and a complete log of the build process should also be returned to NERSC for verification.
Modification Record
This is PARATEC Release 5.1.13b1
Record of Formal Questions and Answers
No entries as yet.
Bibliography
[PAR1] Paratec web page http://www1.nersc.gov/projects/paratec/
[Oliker] "Leading Computational Methods on Scalar and Vector HEC Platforms," Proceedings of SC05 November 1218, 2005 Seattle Washington USA.
[PAR2] "Scaling First Principles Materials Science Codes to Thousands of Processors. " CanningNanoscienceSC04.pdf