Zhihong Lin

FES Requirements Worksheet

1.1. Project Information - SciDAC GSEP Center and GPS-TTBP Center

Document Prepared By	Zhihong Lin
Project Title	SciDAC GSEP Center and GPS-TTBP Center
Principal Investigator	Zhihong Lin
Participating Organizations	University of California, Irvine
Funding Agencies	DOE SC DOE NSA NSF NOAA NIH Other:

2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

The SciDAC GSEP project will further extend the first-principles global gyrokinetic simulations to study new physics in the energetic particle turbulence and transport. The ultimate goal is to build the predictive capability for energetic particle turbulence and transport in the ITER burning plasmas, which requires understanding nonlinear physics of energetic particle instability, predicting energetic particle transport given a fixed energetic particle drive, and self-consistent gyrokinetic simulations of a full burst cycle of energetic particle turbulence.

The scientific goal of the SciDAC GPS-TTBP is the basic understanding of turbulent transport in tokamak plasmas and target research needed for predictive modeling of the confinement properties in burning plasmas.

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

I. Physics model

In GTC simulation, the phase attribute of the fast gyration (or cyclotron) motion of the charged particles around the magnetic field lines is averaged away, reducing the dimensionality of the system from 6D to 5D. This gyrokinetic method removes the fast cyclotron motion, which has a much higher frequency than the characteristic waves of plasma microturbulence. The particle-in-cell method consists of moving particles along the characteristics of the gyrokinetic
equation. The electrostatic potential and field are obtained by solving the Poisson equation on a spatial mesh after gathering the charge density on the grids. The electrostatic forces are subsequently scattered back to the particle positions for advancing the particle orbits. The use of spatial grids and
the procedure of gyroaveraging reduce the intensity of small scale fluctuations (particle noise). Particle collisions can be recovered as a “subgrid” phenomenon via Monte Carlo methods. The particle noise is further reduced by using a perturbative simulation method where only the perturbed distribution function is calculated in simulation. Numerical properties of the electron dynamics are improved by an electrostatic fluid-kinetic hybrid electron mode based on an expansion of the electron response using the electron–ion mass ratio as a small parameter. Electron response is adiabatic in the lowest order and nonadiabatic response is taken into account in the higher order equations.

GTC employs the magnetic coordinates, which provide the most general coordinate system for any magnetic configuration possessing nested surfaces. General geometry with strong shaping has been implemented in GTC using a Poisson solver in real space and a spline fit of the equilibrium data from an MHD code such as EFIT. The property of straight field lines in the magnetic coordinates is
most suitable for describing the instability with field aligned eigenmodes and enables the implementation of an efficient global field aligned mesh for the quasi-2D structure of the plasma turbulence in the toroidal geometry. The global field-aligned mesh provides the highest possible computational efficiency without any simplification in terms of physics models or simulation
geometry. The magnetic coordinates are also desirable for efficiently integrating the particle orbits, which move predominantly along the magnetic field line direction. The equation of motion can be derived from a Hamiltonian formulation which conserves phase space volume and is best for integrating particle orbits for a long period.

II. Computational model

GTC employs three levels of parallelism. The original parallel scheme implemented in GTC is a 1D domain decomposition in the symmetric, toroidal direction (long way around the torus) using Message Passing Interface (MPI). Each MPI process is in charge of a toroidal domain with both particles and
fields. Particles move from one domain to another while they travel around the torus. All communications are one-way traffic to avoid congestion. A second level of parallelism was later implemented11 to increase the concurrency. Within each toroidal domain, we now divide the particles between several MPI processes, but each process keeps a copy of all the fields on a single toroidal
plane. A “particle-domain” communicator links the MPI processes within a toroidal domain of the original 1D domain decomposition, while a “toroidal-domain” communicator links in a ring-like fashion all the MPI processes with the same intra-domain rank. To take advantage of the shared memory capability of multi-core nodes, a third level of parallelism is implemented11 at the loop level using OpenMP compiler directives. These three levels of parallelism using mixed-mode MPI-OpenMP enables GTC to scale to a very large number of processors and use a very large number of particles, which results in a very high phase space resolution and a low statistical noise. The weak scaling of the GTC computing power is almost a linear function of the number of cores up to 100,000 of cores on Cray XT5 supercomputer. GTC is portable and optimized for various scalar and vector supercomputers.

When GTC uses hundreds of thousands of cores, having each node create an individual netCDF restart file causes unacceptable delay in the file system due to the large number of simultaneous file creation requests. To remove this bottleneck, the IO was rewritten to use HDF-5 collectives, and an abstraction layer called ADIOS (ADapatable IO System) [Lofstead2007] was developed and implemented in GTC. ADIOS provides a simple API that can select automatically the best techniques for each different grouping of data as specified by an entry in an external XML configuration file without touching the science part of the code.

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

Currently, PETSc solver is a potential bottleneck for parallization beyond petascale.

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using	NERSC OLCF ACLF NSF Centers Other:
Architectures Used	Cray XT IBM Power BlueGene Linux Cluster Other:
Total Computational Hours Used per Year	50,000,000 Core-Hours
NERSC Hours Used in 2009	7,950,000 Core-Hours
Number of Cores Used in Typical Production Run	3,000-40,000
Wallclock Hours of Single Typical Production Run	20
Total Memory Used per Run	5,000-50,000 GB
Minimum Memory Required per Core	1 GB
Total Data Read & Written per Run	1,000-10,000 GB
Size of Checkpoint File(s)	100-1,000 GB
Amount of Data Moved In/Out of NERSC	1 GB per month
On-Line File Storage Required (For I/O from a Running Job)	1 TB and 100 Files
Off-Line Archival Storage Required	0.1 TB and 1,000 Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year	1,000,000,000
Anticipated Number of Cores to be Used in a Typical Production Run	5,000,000
Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above	20
Anticipated Total Memory Used per Run	5,000,000 GB
Anticipated Minimum Memory Required per Core	1 GB
Anticipated total data read & written per run	1,000,000 GB
Anticipated size of checkpoint file(s)	100,000 GB
Anticipated Amount of Data Moved In/Out of NERSC	100 GB per month
Anticipated On-Line File Storage Required (For I/O from a Running Job)	100 TB and 10,000 Files
Anticipated Off-Line Archival Storage Required	100TB and 10,000 Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 1 μs).

short latency for accessing on-node memory;
efficient multi-core shared-memory parallelization.

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.

Support in I/O, GPU etc.

4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

Multi-level parallization taking advantage of the memory hierarchy.

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).

New Science: long time simulation of transport barrier formation; cross-scale turbulence coupling between energetic particles and thermal plamas.

Expanded HPC resources: more cores and memory, more CPU hours