Stephane Ethier

FES Requirements Worksheet

1.1. Project Information - Global Gyrokinetic PIC Simulations of Plasma Microturbulence

Document Prepared By	Stephane Ethier
Project Title	Global Gyrokinetic PIC Simulations of Plasma Microturbulence
Principal Investigator	Weixing Wang
Participating Organizations	PPPL
Funding Agencies	DOE SC DOE NSA NSF NOAA NIH Other:

2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

We use global, gyrokinetic particle-in-cell simulations to study all aspects of plasma micro-turbulence in the core of tokamak fusion devices. Our highly scalable GTS code takes as input the parameters of real experiments to carry out self-consistent simulations of particles, energy, and momentum transport due to micro-turbulence. One of our objectives is to continue our on-going study of momentum transport under different conditions and for several existing tokamaks. This study is particularly relevant to ITER so we plan on carrying out predictive simulations of ITER to determine its capacity to generate intrinsic rotation. This will include a study of the impact of kinetic electrons, mainly though trapped-electron modes.

If ready, we will also start the study of finite-beta effects in current tokamaks, mainly in NSTX where these effects are believed to be very important.

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

Our production code, GTS, uses the highly-scalable particle-in-cell (PIC) method in which simulation particles are moved along the characteristics in phase space. This reduces the complex gyro-averaged Vlasov equation, a 5-dimensional partial differential equation, to a simple system of ordinary differential equations. Straight-field-line magnetic coordinates in toroidal geometry are employed since they are the natural coordinates which are best able to describe the complex tokamak magnetic equilibrium field and lead to a very accurate time-stepping -- even when a relatively low order method, such as the second-order Runge-Kutta algorithm, is employed. In the PIC method, a grid replaces the direct binary interaction between particles by accumulating the charge of those particles on the grid at every time step and solving for the electro-magnetic field, which is then gathered back to the particles’ positions. The associated grid is built according to the profiles determined from the experimental data of the tokamak shots under investigation. This ensures a uniform coverage of the phase space in terms of resolution. The field on the grid is solved using the PETSc parallel solver library developed by the Mathematics and Computer Science Division at the Argonne National Laboratory. With the combination of inner and outer iterations, and of the fast multi-grid solver in PETSc, the gyrokinetic Poisson equation of integral form, which contains multi-temporal and multi-spatial scale dynamics, is solved accurately in real space. Fully-kinetic electron physics is included using very few approximations while achieving reduced noise. A fully conserving (energy and momentum) Fokker-Planck collision operator is implemented in the code using a Monte Carlo algorithm.

GTS has 3 levels of parallelism: a one-dimensional domain decomposition in the toroidal direction (long way around the torus), dividing both grid and particles, a particle distribution within each domain, which further divides the particles between processors, and a loop-level multi-threading method, which can be use to further divide the computational work within a multi-core node. The domain decomposition and particle distribution are implemented with MPI while the loop-level multi-threading is implemented with OpenMP directives. The latter is very useful to overcome the bandwidth contention between the multiple cores within the nodes, which could be an issue on Hopper. Overall, the 3 levels of parallelism make for a highly scalable code that can run on hundreds of thousand of processor cores.

The size of the simulations is determined primarily by the number of particles although the number of grid points can also be an important when simulating large devices such as ITER.

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

The PETSc library has given us some problems for time to time, mainly in the way it interacts with other libraries on the system. This has to do mainly with the installation procedure. The I/O at scale has also been causing problems for GTS due to its complex and sometimes obscure tuning procedure. We are in the process of implementing ADIOS in our code, which should solve some of these I/O issues. Our main limitation at NERSC still remains the size of our allocation. We are hoping that the extra resources brought about by Hopper will alleviate some of those limitations. In 5 years we expect the PETSc solver to be the main limitation for the code unless a hybrid version is developed and tuned for multi-core.

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using	NERSC OLCF ACLF NSF Centers Other:
Architectures Used	Cray XT IBM Power BlueGene Linux Cluster Other:
Total Computational Hours Used per Year	30,000,000 Core-Hours
NERSC Hours Used in 2009	5,800,000 Core-Hours
Number of Cores Used in Typical Production Run	8,192 to 98,304
Wallclock Hours of Single Typical Production Run	72
Total Memory Used per Run	16,000 to 100,000 GB
Minimum Memory Required per Core	1 GB
Total Data Read & Written per Run	2.5 TB
Size of Checkpoint File(s)	1 – 8 GB
Amount of Data Moved In/Out of NERSC	5 GB per run
On-Line File Storage Required (For I/O from a Running Job)	4 TB and 10,000 Files
Off-Line Archival Storage Required	: 25 TB and Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

PETSc, SPRNG, ADIOS, PSPLINE

4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year	900,000,000
Anticipated Number of Cores to be Used in a Typical Production Run	100,000 – 500,000
Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above	72
Anticipated Total Memory Used per Run	160,000 -500,000 GB
Anticipated Minimum Memory Required per Core	0.5
Anticipated total data read & written per run	32 -160 TB
Anticipated size of checkpoint file(s)	1 GB
Anticipated Amount of Data Moved In/Out of NERSC	10 GB per run
Anticipated On-Line File Storage Required (For I/O from a Running Job)	50 TB and Files
Anticipated Off-Line Archival Storage Required	500 TB and Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

The implementation of a fully electro-magnetic model in the code will put more emphasis on the multi-grid solver. We expect the time spent in the solver to increase 5 fold compared to the current code running a simulation with the same number of grid points. The time spent in the charge deposition and gather-push phase will double in each case since we will now deposit the electric current as well as the charge, and gather the magnetic force as well as the electrostatic force.

We also plan to run a full-f version of GTS, where the particles will describe the complete particle distribution function. The number of particles will need to increase by about 100 times, making those simulations very expensive.

Impurity species will also be added to the code, further increasing the computational cost.

We intend to implement a new dimension of domain decomposition in GTS in order to decrease the memory requirements per MPI process and improve strong scaling.

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 1 μs).

A low interconnect latency and fast gather-scatter to memory are the most important hardware features for our 3D PIC codes.

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.

4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

We will continue to push our current mixed-mode MPI-OpenMP model as far as we can. We are also working with the Future Technologies Group at LBL on tuning the gather-scatter algorithm on multi-core processors and GPUs using different approaches, such as mixed-precision, particle sorting, etc.

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).