The NERSC PMEMD README file
PMEMD: Particle Mesh Ewald Molecular Dynamics
The benchmark code PMEMD (Particle Mesh Ewald Molecular Dynamics) performs various molecular simulations including Molecular Dynamics (MD), NMR Refinement and minimizations. The code is one of about 50 programs that comprise the Amber suite [AMB] and is an extensively-modified version of Sander, the main molecular dynamics driver in Amber. The modifications to Sander in PMEMD are especially to improve parallel scalability. PMEMD does not support all of the options found in sander, but has a significant performance advantage for the most commonly used simulation options.
At a high level an MD simulation involves three components:
- Determining the energy of a system and the forces (gradients of energy) on atoms;
- Moving the atoms according to those forces; and
- Adjusting temperature and pressure to reflect the new state.
Atoms interact with other atoms through pairwise interactions of three basic types: chemical bonds (which is a very fast portion of the computation), electrostatic, and van der Waals.
The electrostatic interaction used to calculate long-range interactions would require a double sum of coulombic terms over interacting (non-bonded) pairs. The Ewald method refers to the separation of the coulombic sum of pair-wise interactions into a direct sum in Cartesian space and a reciprocal sum in Fourier space. The Particle-Mesh Ewald (PME) method, inspired by the Hockney particle/mesh method, scales as N log(N). It achieves this by splitting the electrostatic energy into local interactions that are computed explicitly and long-range interactions that are approximated by a discrete convolution on a grid to which charge has been interpolated, using a three-dimensional FFT to efficiently do the convolution. PMEMD uses a cubic spline interpolation method. [DAR]
PMEMD consists of about 30,000 lines of Fortran90 (as measured by SLOCCount) and about 50 lines of C. Every file is preprocessed via cpp. The C code contains basically three functions, a wrapper to erfc for various flavors of Fortran/C and single or double precision; a routine to find byte size, and a timer routine.
The version supplied uses MPI although it can also be built for a single-processor run and for SHMEM, if available.
Vendor-supplied FFT libraries could potentially be used although the code as supplied has hooks only for SGI libraries or the Fortran90 FFTs included. The SGI versions are selected with the preprocessor flag "-DSGIFFT".
There is one BLAS routine, DDOT, that can be used in the file runmin_cit.f90.
There is a custom Fortran90 version of subroutine short_ene written specifically for the Intel Pentium IV SIMD system. This routine is selected with the preprocessor flag "-DINTEL_P4_VECT_OPT".
The preprocessor flag "-DSLOW_NONBLOCKING_MPI" defined in several files controls the selection of code that may lead to better scalability on some systems by reducing the amount of non-blocking communications.
Relationship to NERSC Workload
PMEMD is very similar to the MD Engine in AMBER 8.0 used in both chemistry and biosciences. The NERSC workload that uses it is funded by BER.
Sander is a parallel program, using the MPI programing interface to communicate among processors. It uses a replicated data structure, in which each processor owns certain atoms, but where all processors know the coordinates of all atoms. At each step, processors compute a portion of the potential energy and corresponding gradients. A binary tree global communication then sums the force vector, so that each processor gets the full force vector components for its owned atoms. The processors then perform a molecular dynamics update step for the owned atoms, and use a second binary tree to communicate the updated positions to all processors, in preparation for the next molecular dynamics step. [CAS]
Because all processors know the positions of all atoms, this model provides a convenient programming environment, in which the division of force-field tasks among the processors can be made in a variety of ways. The main problem is that the communication required at each step is roughly constant with the number of processors, which inhibits parallel scaling. In practice, this communication overhead means that typical explicit solvent molecular dynamics simulations do not scale well beyond about eight processors for a typical cluster with gigabit ethernet, or beyond 16-32 clusters for machines with more efficient (and expensive) interconnection hardware. Implicit solvent simulations, which have many fewer forces and coordinates to communicate, scale significantly better. For these relatively small numbers of processors, inequities in load-balancing and serial portions of the code are not limiting factors, although more work would have to be done for larger processor counts.
To improve performance, PMEMD communicates to each processor only the coordinate information necessary for computing the pieces of the potential energy assigned to it. Other optimizations include use of highly asynchronous communications to achieve a high degree of parallelism. Periodic load balancing steps redistribute the spatially decomposed grid amongst MPI processes. [PER] [CRO] [KAM]
Some PMEMD performance numbers may be found at http://amber.scripps.edu/amber8.bench2.html.
For NERSC-related procurements please visit the procurement site.
Due to licensing restrictions PMEMD cannot be downloaded from NERSC. To obtain the code visit the Amber site at Scripps.
You can download the NERSC-5 PMEMD benchmark input data files here (gzip tar file).
The following simple steps are required to build the code.
- The main directory for building the code is src.pmemd. Change to this directory and first run make clean.
- Look in the subdirectory src.pmemd/Machines for a file of the form Makefile.<machine> where <machine> matches your machine or if necessary, create a new one based on a existing file. Call this file X in step 3.
Note: If you want to do a parallel run make sure that you choose or create a Makefile.<machine> that contains the -DMPI preprocessor option.
- Change to the src.pmemd directory and create a soft link to the file you used/created in step 2.
- Then type make install
Example, assuming a file already has been created in the directory ./src.pmemd/Machines called Machine.jacquard_mpi:
cd src.pmemd rm MACHINE ln -s Machines/Machine.jacquard_mpi MACHINE make install
The executable, called pmemd, will be in the exe subdirectory.
The main program is in src.pmemd/pmemd/pmemd.f90 and MPI_Init is in subroutine parallel_environment_startup in source file src.pmemd/pmemd/alltasks_setup.f90.
PMEMD has been written so as to achieve compatibility with several versions of Amber. You will find two versions of many files/subroutines included, one version named with "cit" and the other without. CIT stands for "Coordinate Index Table", a data structure that is key to the PMEMD performance improvements. Using the "cit" files is the default mode of PMEMD.
A description of the directory structure and some important files is given in the following table.
|Directory or File||Description|
|./||Top-level directory with README|
|./src.pmemd||Directory with top-level Makefile|
|./src.pmemd/pmemd||Directory: All source files, Makefile|
|./src.pmemd/Machines||Directory: Contains several Machine.<X> files with machine-specific environment variable definitions.|
|./src.pmemd/Machine||File: soft link to ./src.pmemd/Machines/.<X>|
|./src.pmemd/Machines/X||Directories for various machines containing system-dependent source code (sys_proto.f90 and erfcfun.c|
|exe||Directory: Executable is put here|
The MPI code must be run on at least 2 processors.
The concurrency simply equals the number of MPI tasks. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing must be equal to the number of physical processors on the node.
The way that you invoke the application differs for the different problem sizes supplied; see the Required Runs section below.
The code appears to do timing three different ways, which will have to be totally fixed. It appears to want CPU time returned in a subroutine called "second." It also appears to want elapsed time and for this it uses two wrappers, one in a subroutine called "wall." Both of these call system-dependent routines and are located in the sys_proto.f90 file in separate system directories ./src.pmemd/Machines/X. The other wrapper for elapsed time is subroutine get_wall_time, which is coded in the C file ./src.pmemd/pmemd/pmemd_clib.c and which calls gettimeofday. (!)
The code is set up to produce about 22 different internal timings. It also calculates averages, minimums, and maximums amongst all tasks. Note: the timing harness of interest is the one that is currently labeled "ELAPSED TIME." The intention is to measure elapsed (wallclock) time.
The code calculates and prints the amount of storage required for real and integer data. The numbers printed represent the cumulative sum for data allocated throughout the code. All integers are declared as type "INTEGER," i.e., no Fortran90 KIND is specified, and all reals are declared as "DOUBLE PRECISION."Memory Required By The Sample Problems:
The minumum memory configuration required to run the problems in each configuration must be reported (OS + buffers + code + data + ...).
There are two molecular systems included in this benchmark, a smaller one and a larger one. They may both be run at a wide range of concurrencies - the program itself does not constrain concurrency outside of certain end cases, e.g. need more than one atom per processor etc. However, the smaller problem, with 23,558 atoms, is intended to be run on 64 or fewer processors and the larger one, which simulates a 90,906-atom blood coagulation protein, is intended to be run on 64 or more processors.
The code requires three command line arguments; see the table and examples below.
|-o <ofile>||Output file name||Change to whatever you want|
|-O||Overwrite output files.||Don't change it.|
|-c <ifile>||Input file name||For small runs <ifile> = inpcrd.equil
For large runs <ifile> = inpcrd
Two examples are below. Note that some machines require putting the command line part of the mpirun command in quotes.
cd run_small mpirun -np 16 "../exe/pmemd.mpi -O -c inpcrd.equil -o run_small_out_jac16" cd ../run_256 mpirun -np 256 "../exe/pmemd.mpi -O -c inpcrd -o out_jac256"
Two additional input files (prtop and mdin) are required for some of the runs. These should not be changed. The file mdin uses the Fortran90 namelist syntax.
There are three subdirectories in which input data files, reference output files, and sample batch submissions scripts are located.Top of File
The top level PMEMD directory contains a verify script that you should run on the output from a run. Syntax is
./verify <output_file>This script checks to see that the end of the output shows the program completed and computes the average energy over the course of the simulation. This script is specific to the chemical system used for the 64 and 256 way run. IT WILL NOT WORK WITH THE SMALL TEST SYSTEM.
Succesful verification looks like this:
./verify run_64/my_res_64 PMEMD VERFICATION SUCCESSFUL ./verify run_256/my_res_256 PMEMD VERFICATION SUCCESSFUL
[AMB] Amber Home Page http://amber.scripps.edu/
[PER] AMBER Performance on HPCx http://www.hpcx.ac.uk/research/hpc/
[CAS] "The Amber biomolecular simulation programs," David A. Case, Thomas E. Cheatham III, Tom Darden, Holger Gohlke, Ray Luo, Kenneth M. Merz Jr., Alexey Onufriev, Carlos Simmerling, Bing Wang, Robert J. Woods, Journal of Computational Chemistry Volume 26, Issue 16, Pages 1668-1688.
[DAR] "Particle mesh Ewald: An N-log(n) method for Ewald sums in large systems," T. Darden, D. York, and Lee Pedersen, The Journal of Chemical Physics -- June 15, 1993 -- Volume 98, Issue 12, pp. 10089-10092
[CRO] " Adventures in Improving the Scaling and Accuracy of a Parallel Molecular Dynamics Program," Michael Crowley , Tom Darden, Thomas Cheatham III and David Deerfield II, The Journal of Supercomputing Volume 11, Number 3 Date: November 1997, Pages: 255 - 278.
[KAM] "Understanding Ultra-Scale Application Communication Requirements," from Kamil, Shalf, Oliker, and Skinner, Proceedings of IISWC 2005. http://www-vis.lbl.gov/Publications/2005/iiswc05.pdf