Jeff Candy

FES Requirements Worksheet

1.1. Project Information - Magnetic Fusion Plasma Microturbulence Project

Document Prepared By	Jeff Candy
Project Title	Magnetic Fusion Plasma Microturbulence Project
Principal Investigator	Bruce Cohen
Participating Organizations	This project spans numerous labs and institutions worldwide.
Funding Agencies	DOE SC DOE NSA NSF NOAA NIH Other:

2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

This project provides general support for gyrokinetic simulation of plasma microturbulence, including the Eulerian codes GYRO and GS2 (USA), GENE (Germany) and the particle-in-cell codes GEM and PG3EQ (USA). This groups of codes has produced the most ambitious turbulent-transport benchmarking exercise ever carried out in the community, and has a track-record of extensive data-sharing and collaboration. On the 5-year time horizon, we have two related goals with respect to GYRO modeling: first, carry out optimizations relevant to multi-scale simulations (simulations which simultaneously resolve ions space/time scales and electron space/time scales) so that GYRO can be *routinely* applied to these cases by users. Second, improve GYRO performance and integration into TGYRO to make profile-prediction a routine practice as well.

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

The dominant usage of CPU cycles comes from the GYRO code, so we will narrow the discussion to GYRO only for brevity. The temporal discretization is described in Chapter 5 of the GYRO Technical Guide (https://fusion.gat.com/THEORY/images/e/ea/Gyro_technical_guide.pdf). Briefly, the options are explicit RK4 in the case that all species are gyrokinetic. If electrons are drift-kinetic, they can be treated implicitly for improved efficiency, and in this case a second-order IMEX-RK method is used. The spatial discretization is described in Chapter 4 of the Technical Guide. Briefly, a mixture of finite-difference, finite-element, pseudo-spectral and spectral methods are used. Parallelization is accomplished by pure MPI code, although we are geared-up for a complete reworking of the parallelization scheme to target multi-scale wavenumber resolution on multi-core architectures. One feature of GYRO (and of Eulerian codes in general) that has been borne out by a decade of code results is that the grid resolution can be significantly less that in competing particle-in-cell codes due to high-order discretization methods and clever choice of coordinates. In GYRO, for a typical (not multi-scale) production run, we need about 180 radial gridpoints (5th order or higher method), 16 complex toroidal modes (spectral), 10 poloidal arc points per orbit (3rd order), and 128 velocity gridpoints (spectral). For multi-scale runs, the resolution requirements increase significantly with maximum resolved wavenumber and these challenging cases are becoming the norm for GYRO users.

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

Performance is not optimal on multi-core architectures, but this is not specific to NERSC.

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using	NERSC OLCF ACLF NSF Centers Other:
Architectures Used	Cray XT IBM Power BlueGene Linux Cluster Other:
Total Computational Hours Used per Year	30,000,000 Core-Hours
NERSC Hours Used in 2009	1,200,000 Core-Hours
Number of Cores Used in Typical Production Run	512
Wallclock Hours of Single Typical Production Run	12
Total Memory Used per Run	512 GB
Minimum Memory Required per Core	1 GB
Total Data Read & Written per Run	GB
Size of Checkpoint File(s)	4 GB
Amount of Data Moved In/Out of NERSC	GB per
On-Line File Storage Required (For I/O from a Running Job)	TB and Files
Off-Line Archival Storage Required	TB and Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year	5,000,000
Anticipated Number of Cores to be Used in a Typical Production Run	1024
Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above	24
Anticipated Total Memory Used per Run	1024 GB
Anticipated Minimum Memory Required per Core	2 GB
Anticipated total data read & written per run	GB
Anticipated size of checkpoint file(s)	8 GB
Anticipated Amount of Data Moved In/Out of NERSC	GB per
Anticipated On-Line File Storage Required (For I/O from a Running Job)	TB and Files
Anticipated Off-Line Archival Storage Required	TB and Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

The mathematical methods and overall code base are proven. However, significant re-optimization is needed now that architectures have evolved to multi-core and users are increasingly focusing on multi-scale simulations to more accurately capture the full electron energy transport physics.

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 1 μs).

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.

Assistance with code profiling and performance tuning is more critical than ever.

4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

we are moving towards factorizing GYRO into discrete components (computational kernels) that can be optimized or otherwise rewritten on a per-architecture basis. We have avoided this in the past because of the difficulty of maintaining multiple sources, but in the future this may be unavoidable.

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).

The evolution of HPC resources has not been in an optimal direction from our perspective. Centers like the OLCF have imposed a paradigm by which codes are supposed to run with more and more cores in a fixed amount of time (2-12 hours). However, in reality, increasing spatial resolution can bring in shorter timescales, so that even if codes scale well with increased spatial resolution, the total number of timesteps and thus total wallclock time must increase. Thus, rather than very large core counts for the standard 2-12 hours, we need modestly increased core counts for significantly longer times (24-28 hours). Generally speaking, what is good for science are almost always large ensembles of runs (hundreds) which explore parameter space, never a few runs which approach the full machine size. So, as always, we think dedicated access with short wait-time to moderate core counts is the most useful service a computing center can provide. We find that compute capacity is more important (perhaps much more important) for the progress of science than ultimate compute capability.