Powering Scientific Discovery Since 1974

Code Description

GTC: 3D Gyrokinetic Toroidal Code

General Description

GTC is a 3-dimensional code used to study microturbulence in magnetically confined toroidal fusion plasmas via the Particle-In-Cell (PIC) method. It solves the gyro-averaged Vlasov equation in real space using global gyrokinetic techniques and an electrostatic approximation. The Vlasov equation describes the evolution of a system of particles under the effects of self-consistent electromagnetic fields. The unknown is the flux, f(t,x,v), which is a function of time t , position x, and velocity v, and represents the distribution function of particles (electrons, ions,...) in phase space. This model assumes a collision-less plasma; i.e., the particles interact with one another only through a self-consistent field and thus the algorithm scales as N instead of N2, where N is the number of particles. [GTC]

A "classical" PIC algorithm involves using the notion of discrete "particles" to sample a charge distribution function. The particles interact via a grid, on which the potential is calculated from charges "deposited" at the grid points. Four steps are involved iteratively:

1. "SCATTER," or deposit, charges on the grid. A particle typically contributes to many grid points and many particles contribute to any one grid point.
2. Solve the Poisson Equation;
3. "GATHER" forces on each particle from the resultant potential function;
4. and "PUSH" (move) the particles to new locations on the grid.

In GTC point-charge particles are replaced by charged rings using a four-point gyro averaging scheme, as shown below.

An important approximation in GTC comes from the fact that fast particle motion along the magnetic field lines leads to a quasi-two-dimensional structure in the electrostatic potential. Thus, the Poisson Equation needs only to be solved on a 2-D poloidal plane. GTC uses a logically non-rectangular field-aligned grid, in part, to keep the number of particles per cell nearly constant. The mesh effectively twists along with the magnetic field in the toroidal direction.

A PIC simulation is characterized by the number of particles used and the size of the mesh. GTC has been used in runs with over 1 billion particles and over 100 million grid points.

Coding

The GTC code has been optimized to achieve high efficiency on both cache-based super scalar and vector nodes. Both vector and super scalar code for the Poisson Equation solver are provided.

In GTC, the main bottleneck is usually the charge deposition, or scatter operation, as is true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionately to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex because of the quickly gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations than in "classical" PIC, since several points are picked on the rings and each of them has its own neighboring grid points.

This version of GTC is written entirely in free-format Fortran90/95, about 4300 lines of code (17 source files), including routines for which multiple versions are provided.

Authorship

Written by Zhihong Lin, Stephane Ethier, and Jerome Lewandowski of the Princeton Plasma Physics Laboratory.

GTC is used for fusion energy research via the DOE SciDAC program [FUS] and for the international fusion collaboration [ITER]. Support for it comes from the DOE office of Fusion Energy Science. It is used for studying neoclassical and turbulent transport in tokamaks and stellarators as well as for investigating hot-particle physics, toroidal Alfven modes, and neoclassical tearing modes. [LEE].

Parallelization

The primary parallel programming model for GTC is a 1-D domain decomposition with each MPI process assigned a toroidal section and MPI_SENDRECV used to shift particles moving between domains. A variety of both grid-based and particle-based loops can also be multitasked via OpenMP directives included in the code. MPI library calls and OpenMP directives are not isolated in any one routine or source file. Builds can be MPI-only or combined OpenMP/MPI.

In GTC the entire calculation is done in real space, including the Poisson solve, thereby improving parallel scalability. In a typical simulation only about 10-15% of the time is spent in interprocessor communication. However, due to the gather/scatter nature of the charge deposition in PIC codes, GTC is very senstive to memory latency. [Ethier].

The GTC distribution includes highly vectorizable routines for the charge scatter, Poisson solve, particle push, and particle shift between domains. The vectorizable scatter routines can make use of gather/scatter hardware and have been vectorized using an algorithm that duplicates charge "bins" at the expense of additional memory in order to eliminate loop-carried dependences. "Atomic" memory operations could also mitigate the effect of the dependences. The duplication is done to the extent of the hardware vector length. The same loop-carried dependences and method for eliminating them apply to OpenMP parallelization of these loops.

Top of File

Obtaining the Code

For NERSC-related procurements please visit the procurement site.

Due to licensing restrictions GTC cannot be downloaded from NERSC. To obtain the code visit the PPPL site.

You can obtain the NERSC-5 GTC benchmark input data files here (gzip tar file).

Top of File

Building the Code

GNUmake (gmake) is required. Several files are preprocessed by the compiler "inline." All files are either .F90 or .f90; there are no header files.

All source files are in a single directory and there is a single Makefile with definitions for several different systems included (IRIX64, AIX, Linux, SUPER-UX, and UNICOS). The makefile runs the uname -s command to detect the operating system automatically.

Note well that some compilers may not like the way function types are declared in the file function.f90. An alternate version of this file is provided. Rename function_alt.f90 to function.f90 to solve the problem.

Two basic versions of the code can be built:

VersionExecutable Name
MPI-only gtcmpi
Combined MPI/OpenMP gtc

The basic command to build the code is

gmake <opt1=y|n> <opt2=y|n> ...

Options included in the Makefile are:

OPENMP=y|n
With or without OpenMP support; default is y.
DOUBLE_PRECISION=y
8-byte floating point precision
DEBUG=y|n
With or without debug option (-g). The default is n.
ESSL=y|n
With or without the AIX Engineering Scientific Subroutine Library, used for the FFT routines. The default is n.
64BITS=y|n
64- or 32-bit compilation mode on AIX. The default is 64BITS.
PGI=y
Use the PGI compiler (pgf90) on Linux. The default is to use the Lahey-Fujitsu compiler lf95.
ALTIX=y
Compiles with Intel compilers on the Altix using ifort ... -lmpi

The vendor is expected to provide the appropriate system calls in timers.f to return local processor CPU time and global elapsed (wall clock) time.

The MPI ranks are identified by the variable "mype" and use mype=0 as the master.

Top of File

Build-Related Files in This Distribution

File NamePurposePreprocessing
main.F90 main with MPI_Init
module.F90 precision, parameter,
and global array declarations
double precision and NEC SX-
setup.F90 MPI_comm_size, MPI_comm_rank AIX, OpenMP, Debug
setup_vec.F90 same as setup with add'l code to minimize bank conflicts on vector systems AIX, OpenMP, SX, Debug
shifte.F90 move electrons across domains OpenMP
shifti.F90 move ions across domains OpenMP
shifti_vec.F90 vectorizable version with SX-6-specific performance monitoring calls SX-
chargee.F90 interpolate electron charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE. OpenMP
chargei.F90 interpolate ion charge density to grid first locally and then globally using MPI_SENDRECV and MPI_ALLREDUCE. OpenMP
chargei.F90 vectorizable version SX and OpenMP; SX compiler directives
tracking.F90 tag each particle with a unique number that will be carried along ! with it as it moves from one processor/domain to another. Write track data to HDF files when doing a snapshot HDF
smooth.F90 charge smoothing, MPI_SENDRECV of flux; MPI_Gather for transpose of 2D matrix; ESSL
diagnosis.F90 global sums, write output files NERSC
pushi_vec.F90 vectorizable version of ion particle push SX performance trace routines
pushi.f90 ion particle push, MPI_ALLREDUCE for various diagnostics
pushe.f90 electron particle push, MPI_ALLREDUCE for various diagnostics
snapshot.f90 write particle data for re-run, MPI_REDUCE for ?
field.f90 finite differencing for electric field, MPI_SENDRECV
function.f90 various Fortran90 function definitions
function_alt.f90 version for some compilers that don't like the function declarations in function.f90
load.f90 initialize particle position and velocity; not timed??
poisson.f90 iterative poisson solve first locally and then globally via MPI_ALLGATHER
poisson_vec.f90 vectorizable version
set_random_values.f90 random number generators with Fortran90 and Fortran77 interfaces
fft_cray _essl _gl _nag and _sx6.f90 interfaces to various system-specific fft libraries
Makefile make file
Top of File

Running the Code

Benchmark code GTC must be run using 64-bit floating-point precision for all non-integer data variables.

There is a subdirectory "run" in which input data files and sample batch submission scripts are located. The code does not require/accept any command line arguments. However, an input data file, called "gtc.input," is required.

Top of File

Timing Issues

All timings made by this code are via the timer() subroutine in main.F90. This routine returns both the CPU and wall (elapsed) time. The source should be modified by the vendor to provide the former for the particular hardware with acceptable accuracy and resolution; the latter is returned via MPI_WTIME() and should remain unchanged.

The time to be reported is the line "TOTAL WALL CLOCK TIME(SEC)."

The code is heavily instrumented with eight different portions of the code timed separately. The initialization phase is timed but for runs with large numbers of timesteps it comprises only a small portion of the total time. All MPI processes call the timers but only the master process prints the time and there are no barriers before each timing call to measure the collective time of all processors.

Here is a timing profile for the three sample problems, obtained from the machine Jacquard at NERSC.

SmallMediumLarge
PEs 4 64 256
particles 323590 323590 323590
gridpoints 32449 32449 32449
#domains 4 64 256
Timesteps 100 2000 2000
Times in seconds:
pusher 148.4 3,191. 3,251.
shift 5.35 481.5 1,337.
charge 156.5 3,483. 3,713.
poisson 19.1 765.0 760.8
smooth 10.6 218.6 403.3
field 3.6 86.1 87.2
total 347.3 8,229. 9,559.
%Pusher 43% 39% 34%
%Shifter 2% 6% 14%
%poisson 5% 9% 8%
%smoother 3% 3% 4%
%Charge 45% 42% 39%
Top of File

Storage Issues

Almost all of the storage requirements comes from the allocatable arrays for particle, field, and diagnostic data declared in module.F90 and allocated in setup.F90.

Memory Required (per MPI process) By The Sample Problems:

 Small 0.26 GB Medium 0.26 GB Large 0.26 GB
Top of File

Required Runs

Three problems are to be run and their outputs reported; see the table below. These three problems correspond to the same number of particles and spatial grid points; they differ only in concurrency (16, 64, and 256 domains, respectively) and number of timesteps. All input from these files is namelist based and is explained at the end of the files; however, nothing in the three input files provided should be changed for the runs.

Problem SizeConcurrencyLattice Size
small 16
medium 64
large 256

The small case is used for testing. The medium case and large case are to be run with fixed concurrency. With pure MPI, the concurrency simply equals the number of MPI tasks. With hybrid MPI/OpenMP, concurrency means the number of MPI tasks times the number of threads per MPI task. Computational nodes employed in the benchmark must be fully-packed, that is, the number processes or threads executing on each node must be equal to the number of physical processors on this node.

The input file must be named "gtc.input" and be in the current working directory. A problem description is printed which reflects the input parameters. It also tells if OpenMP threads were used in the run.

Top of File

Verifying Results

Different numbers of MPI tasks and/or different compilers on different platforms can lead to slightly different numerical truncations with GTC. So the final results in xxx.history.out, xxx.sheareb.out and xxx.snap015.out files are often not bit-for-bit matched.

There is an exponential turbulence growth period of the thermal flux calculated by GTC that happens between ~1000 to ~2000 time steps. The flux is printed in the 4th column of stderr or stdout.

A sample script "analyze_gtc" in the "run" subdirectory is provided to verify the correctness of the results. This script extracts the 5 columns of the result file between two time steps specified in the command line into a file "run_data.dat"and then calculates the flux growth rate. It also can visualize the results using gnuplot.

On seaborg, the growth_rates between time steps 1200 and 1800 are "0.0102292" and "0.00987657" for the medium and large benchmark runs, respectively. Your result should be very close to 0.01. An example in which gnuplot is not present is shown below.

% analyze_gtc medium.err 1200 2000 data file = medium.err t1 = 1000 t2 = 2000 growth rate = 0.00994677
Top of File

Modification Record

This is Version 2 (03/28/2005)

Top of File

Top of File

Bibliography

Top of File

[ZIN] "Particle-in-cell simulations of electron transport from plasma turbulence: recent progrress in gyrokinetic particle simulations of turbulent plasmas, Z. Lin, G. Rewoldt, S. Ethier, et. al, "Journal of Physics: Conference Series 16 (2005) 16-24.

[GTC] Gyrokinetic Particle Simulations http://w3.pppl.gov/theory/proj_gksim.html

[FUS] SciDAC Fusion Research http://www.scidac.gov/FES/FES_GPSC.html

[ITER] ITER Fusion Research http://www.iter.org"

[LEE] full-torus simulation of turbulent transport http://www.nersc.gov/news/annual_reports/annrep01/sh_FES_05.htm

[Ethier] "Gyrokinetic particle-in-cell simulations of plasma microturbulence," S. Ethier, W. Tang, and Z. Lin, "Journal of Physics: Conference Series 16 (2005) 1-15.