The NERSC GTC README File
- Code Description
- Obtaining the Code
- Building the Code
- Files in this distribution
- Running the Code
- Timing Issues
- Storage Issues
- Required Runs
- Verifying Results
- Modification Record
- Record of Formal Questions and Answers
GTC: 3D Gyrokinetic Toroidal Code
GTC is a 3-dimensional code used to study microturbulence in magnetically confined toroidal fusion plasmas via the Particle-In-Cell (PIC) method. It solves the gyro-averaged Vlasov equation in real space using global gyrokinetic techniques and an electrostatic approximation. The Vlasov equation describes the evolution of a system of particles under the effects of self-consistent electromagnetic fields. The unknown is the flux, f(t,x,v), which is a function of time t , position x, and velocity v, and represents the distribution function of particles (electrons, ions,...) in phase space. This model assumes a collision-less plasma; i.e., the particles interact with one another only through a self-consistent field and thus the algorithm scales as N instead of N2, where N is the number of particles. [GTC]
A "classical" PIC algorithm involves using the notion of discrete "particles" to sample a charge distribution function. The particles interact via a grid, on which the potential is calculated from charges "deposited" at the grid points. Four steps are involved iteratively:
- "SCATTER," or deposit, charges on the grid. A particle typically contributes to many grid points and many particles contribute to any one grid point.
- Solve the Poisson Equation;
- "GATHER" forces on each particle from the resultant potential function;
- and "PUSH" (move) the particles to new locations on the grid.
In GTC point-charge particles are replaced by charged rings using a four-point gyro averaging scheme, as shown below.
An important approximation in GTC comes from the fact that fast particle motion along the magnetic field lines leads to a quasi-two-dimensional structure in the electrostatic potential. Thus, the Poisson Equation needs only to be solved on a 2-D poloidal plane. GTC uses a logically non-rectangular field-aligned grid, in part, to keep the number of particles per cell nearly constant. The mesh effectively twists along with the magnetic field in the toroidal direction.
A PIC simulation is characterized by the number of particles used and the size of the mesh. GTC has been used in runs with over 1 billion particles and over 100 million grid points.
The GTC code has been optimized to achieve high efficiency on both cache-based super scalar and vector nodes. Both vector and super scalar code for the Poisson Equation solver are provided.
In GTC, the main bottleneck is usually the charge deposition, or scatter operation, as is true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionately to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex because of the quickly gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations than in "classical" PIC, since several points are picked on the rings and each of them has its own neighboring grid points.
This version of GTC is written entirely in free-format Fortran90/95, about 4300 lines of code (17 source files), including routines for which multiple versions are provided.
Relationship to NERSC Workload
GTC is used for fusion energy research via the DOE SciDAC program [FUS] and for the international fusion collaboration [ITER]. Support for it comes from the DOE office of Fusion Energy Science. It is used for studying neoclassical and turbulent transport in tokamaks and stellarators as well as for investigating hot-particle physics, toroidal Alfven modes, and neoclassical tearing modes. [LEE].
The primary parallel programming model for GTC is a 1-D domain decomposition with each MPI process assigned a toroidal section and MPI_SENDRECV used to shift particles moving between domains. A variety of both grid-based and particle-based loops can also be multitasked via OpenMP directives included in the code. MPI library calls and OpenMP directives are not isolated in any one routine or source file. Builds can be MPI-only or combined OpenMP/MPI.
In GTC the entire calculation is done in real space, including the Poisson solve, thereby improving parallel scalability. In a typical simulation only about 10-15% of the time is spent in interprocessor communication. However, due to the gather/scatter nature of the charge deposition in PIC codes, GTC is very senstive to memory latency. [Ethier].
The GTC distribution includes highly vectorizable routines for the charge scatter, Poisson solve, particle push, and particle shift between domains. The vectorizable scatter routines can make use of gather/scatter hardware and have been vectorized using an algorithm that duplicates charge "bins" at the expense of additional memory in order to eliminate loop-carried dependences. "Atomic" memory operations could also mitigate the effect of the dependences. The duplication is done to the extent of the hardware vector length. The same loop-carried dependences and method for eliminating them apply to OpenMP parallelization of these loops.
For NERSC-related procurements please visit the procurement site.
Due to licensing restrictions GTC cannot be downloaded from NERSC. To obtain the code visit the PPPL site.
You can obtain the NERSC-5 GTC benchmark input data files here (gzip tar file).
GNUmake (gmake) is required. Several files are preprocessed by the compiler "inline." All files are either .F90 or .f90; there are no header files.
All source files are in a single directory and there is a single Makefile with definitions for several different systems included (IRIX64, AIX, Linux, SUPER-UX, and UNICOS). The makefile runs the uname -s command to detect the operating system automatically.
Note well that some compilers may not like the way function types are declared in the file function.f90. An alternate version of this file is provided. Rename function_alt.f90 to function.f90 to solve the problem.
Two basic versions of the code can be built:
The basic command to build the code is
Options included in the Makefile are:
- With or without OpenMP support; default is y.
- 8-byte floating point precision
- With or without debug option (-g). The default is n.
- With or without the AIX Engineering Scientific Subroutine Library, used for the FFT routines. The default is n.
- 64- or 32-bit compilation mode on AIX. The default is 64BITS.
- Use the PGI compiler (pgf90) on Linux. The default is to use the Lahey-Fujitsu compiler lf95.
- Compiles with Intel compilers on the Altix using ifort ... -lmpi
The vendor is expected to provide the appropriate system calls in timers.f to return local processor CPU time and global elapsed (wall clock) time.
The MPI ranks are identified by the variable "mype" and use mype=0 as the master.Top of File
|README.html||README in HTML format|
|README||README in ASCII text|
|main.F90||main with MPI_Init|
and global array declarations
|double precision and NEC SX-|
|setup.F90||MPI_comm_size, MPI_comm_rank||AIX, OpenMP, Debug|
|setup_vec.F90||same as setup with add'l code to minimize bank conflicts on vector systems||AIX, OpenMP, SX, Debug|
|shifte.F90||move electrons across domains||OpenMP|
|shifti.F90||move ions across domains||OpenMP|
|shifti_vec.F90||vectorizable version with SX-6-specific performance monitoring calls||SX-|
|chargee.F90||interpolate electron charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE.||OpenMP|
|chargei.F90||interpolate ion charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE.||OpenMP|
|chargei.F90||vectorizable version||SX and OpenMP; SX compiler directives|
|tracking.F90||tag each particle with a unique number that will be carried along ! with it as it moves from one processor/domain to another. Write track data to HDF files when doing a snapshot||HDF|
|smooth.F90||charge smoothing, MPI_SENDRECV of flux; MPI_Gather for transpose of 2D matrix;||ESSL|
|diagnosis.F90||global sums, write output files||NERSC|
|pushi_vec.F90||vectorizable version of ion particle push||SX performance trace routines|
|pushi.f90||ion particle push, MPI_ALLREDUCE for various diagnostics|
|pushe.f90||electron particle push, MPI_ALLREDUCE for various diagnostics|
|snapshot.f90||write particle data for re-run, MPI_REDUCE for ?|
|field.f90||finite differencing for electric field, MPI_SENDRECV|
|function.f90||various Fortran90 function definitions|
|function_alt.f90||version for some compilers that don't like the function declarations in function.f90|
|load.f90||initialize particle position and velocity; not timed??|
|poisson.f90||iterative poisson solve first locally and then globally via MPI_ALLGATHER|
|restart.f90||restart dump write/read|
|set_random_values.f90||random number generators with Fortran90 and Fortran77 interfaces|
|fft_cray _essl _gl _nag and _sx6.f90||interfaces to various system-specific fft libraries|
Benchmark code GTC must be run using 64-bit floating-point precision for all non-integer data variables.
There is a subdirectory "run" in which input data files and sample batch submission scripts are located. The code does not require/accept any command line arguments. However, an input data file, called "gtc.input," is required.Top of File
All timings made by this code are via the timer() subroutine in main.F90. This routine returns both the CPU and wall (elapsed) time. The source should be modified by the vendor to provide the former for the particular hardware with acceptable accuracy and resolution; the latter is returned via MPI_WTIME() and should remain unchanged.
The time to be reported is the line "TOTAL WALL CLOCK TIME(SEC)."
The code is heavily instrumented with eight different portions of the code timed separately. The initialization phase is timed but for runs with large numbers of timesteps it comprises only a small portion of the total time. All MPI processes call the timers but only the master process prints the time and there are no barriers before each timing call to measure the collective time of all processors.
Here is a timing profile for the three sample problems, obtained from the machine Jacquard at NERSC.
|Times in seconds:|
Almost all of the storage requirements comes from the allocatable arrays for particle, field, and diagnostic data declared in module.F90 and allocated in setup.F90.
Memory Required (per MPI process) By The Sample Problems:
Three problems are to be run and their outputs reported; see the table below. These three problems correspond to the same number of particles and spatial grid points; they differ only in concurrency (16, 64, and 256 domains, respectively) and number of timesteps. All input from these files is namelist based and is explained at the end of the files; however, nothing in the three input files provided should be changed for the runs.
|Problem Size||Concurrency||Lattice Size|
The small case is used for testing. The medium case and large case are to be run with fixed concurrency. With pure MPI, the concurrency simply equals the number of MPI tasks. With hybrid MPI/OpenMP, concurrency means the number of MPI tasks times the number of threads per MPI task. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing on each node must be equal to the number of physical processors on this node.
The input file must be named "gtc.input" and be in the current working directory. A problem description is printed which reflects the input parameters. It also tells if OpenMP threads were used in the run.Top of File
Different numbers of MPI tasks and/or different compilers on different platforms can lead to slightly different numerical truncations with GTC. So the final results in xxx.history.out, xxx.sheareb.out and xxx.snap015.out files are often not bit-for-bit matched.
There is an exponential turbulence growth period of the thermal flux calculated by GTC that happens between ~1000 to ~2000 time steps. The flux is printed in the 4th column of stderr or stdout.
A sample script "analyze_gtc" in the "run" subdirectory is provided to verify the correctness of the results. This script extracts the 5 columns of the result file between two time steps specified in the command line into a file "run_data.dat"and then calculates the flux growth rate. It also can visualize the results using gnuplot.
On seaborg, the growth_rates between time steps 1200 and 1800 are "0.0102292" and "0.00987657" for the medium and large benchmark runs, respectively. Your result should be very close to 0.01. An example in which gnuplot is not present is shown below.
This is Version 2 (03/28/2005)Top of File
[ZIN] "Particle-in-cell simulations of electron transport from plasma turbulence: recent progrress in gyrokinetic particle simulations of turbulent plasmas, Z. Lin, G. Rewoldt, S. Ethier, et. al, "Journal of Physics: Conference Series 16 (2005) 16-24.
[GTC] Gyrokinetic Particle Simulations http://w3.pppl.gov/theory/proj_gksim.html
[FUS] SciDAC Fusion Research http://www.scidac.gov/FES/FES_GPSC.html
[ITER] ITER Fusion Research http://www.iter.org"
[LEE] full-torus simulation of turbulent transport http://www.nersc.gov/news/annual_reports/annrep01/sh_FES_05.htm