NERSCPowering Scientific Discovery Since 1974

Stephane Ethier

FES Requirements Worksheet

1.1. Project Information - Global Gyrokinetic PIC Simulations of Plasma Microturbulence

Document Prepared By

Stephane Ethier

Project Title

Global Gyrokinetic PIC Simulations of Plasma Microturbulence

Principal Investigator

Weixing Wang

Participating Organizations


Funding Agencies


2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

We use global, gyrokinetic particle-in-cell simulations to study all aspects of plasma micro-turbulence in the core of tokamak fusion devices. Our highly scalable GTS code takes as input the parameters of real experiments to carry out self-consistent simulations of particles, energy, and momentum transport due to micro-turbulence. One of our objectives is to continue our on-going study of momentum transport under different conditions and for several existing tokamaks. This study is particularly relevant to ITER so we plan on carrying out predictive simulations of ITER to determine its capacity to generate intrinsic rotation. This will include a study of the impact of kinetic electrons, mainly though trapped-electron modes. 
If ready, we will also start the study of finite-beta effects in current tokamaks, mainly in NSTX where these effects are believed to be very important.

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

Our production code, GTS, uses the highly-scalable particle-in-cell (PIC) method in which simulation particles are moved along the characteristics in phase space. This reduces the complex gyro-averaged Vlasov equation, a 5-dimensional partial differential equation, to a simple system of ordinary differential equations. Straight-field-line magnetic coordinates in toroidal geometry are employed since they are the natural coordinates which are best able to describe the complex tokamak magnetic equilibrium field and lead to a very accurate time-stepping -- even when a relatively low order method, such as the second-order Runge-Kutta algorithm, is employed. In the PIC method, a grid replaces the direct binary interaction between particles by accumulating the charge of those particles on the grid at every time step and solving for the electro-magnetic field, which is then gathered back to the particles’ positions. The associated grid is built according to the profiles determined from the experimental data of the tokamak shots under investigation. This ensures a uniform coverage of the phase space in terms of resolution. The field on the grid is solved using the PETSc parallel solver library developed by the Mathematics and Computer Science Division at the Argonne National Laboratory. With the combination of inner and outer iterations, and of the fast multi-grid solver in PETSc, the gyrokinetic Poisson equation of integral form, which contains multi-temporal and multi-spatial scale dynamics, is solved accurately in real space. Fully-kinetic electron physics is included using very few approximations while achieving reduced noise. A fully conserving (energy and momentum) Fokker-Planck collision operator is implemented in the code using a Monte Carlo algorithm. 
GTS has 3 levels of parallelism: a one-dimensional domain decomposition in the toroidal direction (long way around the torus), dividing both grid and particles, a particle distribution within each domain, which further divides the particles between processors, and a loop-level multi-threading method, which can be use to further divide the computational work within a multi-core node. The domain decomposition and particle distribution are implemented with MPI while the loop-level multi-threading is implemented with OpenMP directives. The latter is very useful to overcome the bandwidth contention between the multiple cores within the nodes, which could be an issue on Hopper. Overall, the 3 levels of parallelism make for a highly scalable code that can run on hundreds of thousand of processor cores. 
The size of the simulations is determined primarily by the number of particles although the number of grid points can also be an important when simulating large devices such as ITER.  

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

The PETSc library has given us some problems for time to time, mainly in the way it interacts with other libraries on the system. This has to do mainly with the installation procedure. The I/O at scale has also been causing problems for GTS due to its complex and sometimes obscure tuning procedure. We are in the process of implementing ADIOS in our code, which should solve some of these I/O issues. Our main limitation at NERSC still remains the size of our allocation. We are hoping that the extra resources brought about by Hopper will alleviate some of those limitations. In 5 years we expect the PETSc solver to be the main limitation for the code unless a hybrid version is developed and tuned for multi-core. 

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using

 NERSC  OLCF  ACLF  NSF Centers  Other:  

Architectures Used

 Cray XT  IBM Power  BlueGene  Linux Cluster  Other:  

Total Computational Hours Used per Year

 30,000,000 Core-Hours

NERSC Hours Used in 2009

 5,800,000 Core-Hours

Number of Cores Used in Typical Production Run

8,192 to 98,304

Wallclock Hours of Single Typical Production Run


Total Memory Used per Run

 16,000 to 100,000 GB

Minimum Memory Required per Core

 1 GB

Total Data Read & Written per Run

2.5 TB

Size of Checkpoint File(s)

 1 – 8 GB

Amount of Data Moved In/Out of NERSC

5 GB per  run

On-Line File Storage Required (For I/O from a Running Job)

4 TB and 10,000 Files

Off-Line Archival Storage Required

 : 25   TB and  Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.


4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year


Anticipated Number of Cores to be Used in a Typical Production Run

100,000 – 500,000

Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above


Anticipated Total Memory Used per Run

 160,000 -500,000 GB

Anticipated Minimum Memory Required per Core


Anticipated total data read & written per run

32 -160 TB

Anticipated size of checkpoint file(s)

 1 GB

Anticipated Amount of Data Moved In/Out of NERSC

 10 GB per  run

Anticipated On-Line File Storage Required (For I/O from a Running Job)

50 TB and  Files

Anticipated Off-Line Archival Storage Required

 500 TB and  Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

The implementation of a fully electro-magnetic model in the code will put more emphasis on the multi-grid solver. We expect the time spent in the solver to increase 5 fold compared to the current code running a simulation with the same number of grid points. The time spent in the charge deposition and gather-push phase will double in each case since we will now deposit the electric current as well as the charge, and gather the magnetic force as well as the electrostatic force.  
We also plan to run a full-f version of GTS, where the particles will describe the complete particle distribution function. The number of particles will need to increase by about 100 times, making those simulations very expensive. 
Impurity species will also be added to the code, further increasing the computational cost. 
We intend to implement a new dimension of domain decomposition in GTS in order to decrease the memory requirements per MPI process and improve strong scaling. 

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 1 μs).

A low interconnect latency and fast gather-scatter to memory are the most important hardware features for our 3D PIC codes.

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.


4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

We will continue to push our current mixed-mode MPI-OpenMP model as far as we can. We are also working with the Future Technologies Group at LBL on tuning the gather-scatter algorithm on multi-core processors and GPUs using different approaches, such as mixed-precision, particle sorting, etc. 

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).