NERSCPowering Scientific Discovery for 50 Years

Jeff Candy

FES Requirements Worksheet

1.1. Project Information - Magnetic Fusion Plasma Microturbulence Project

Document Prepared By

Jeff Candy

Project Title

Magnetic Fusion Plasma Microturbulence Project

Principal Investigator

Bruce Cohen

Participating Organizations

This project spans numerous labs and institutions worldwide.

Funding Agencies

 DOE SC  DOE NSA  NSF  NOAA  NIH  Other:

2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

This project provides general support for gyrokinetic simulation of plasma microturbulence, including the Eulerian codes GYRO and GS2 (USA), GENE (Germany) and the particle-in-cell codes GEM and PG3EQ (USA). This groups of codes has produced the most ambitious turbulent-transport benchmarking exercise ever carried out in the community, and has a track-record of extensive data-sharing and collaboration. On the 5-year time horizon, we have two related goals with respect to GYRO modeling: first, carry out optimizations relevant to multi-scale simulations (simulations which simultaneously resolve ions space/time scales and electron space/time scales) so that GYRO can be *routinely* applied to these cases by users. Second, improve GYRO performance and integration into TGYRO to make profile-prediction a routine practice as well.

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

The dominant usage of CPU cycles comes from the GYRO code, so we will narrow the discussion to GYRO only for brevity. The temporal discretization is described in Chapter 5 of the GYRO Technical Guide (https://fusion.gat.com/THEORY/images/e/ea/Gyro_technical_guide.pdf). Briefly, the options are explicit RK4 in the case that all species are gyrokinetic. If electrons are drift-kinetic, they can be treated implicitly for improved efficiency, and in this case a second-order IMEX-RK method is used. The spatial discretization is described in Chapter 4 of the Technical Guide. Briefly, a mixture of finite-difference, finite-element, pseudo-spectral and spectral methods are used. Parallelization is accomplished by pure MPI code, although we are geared-up for a complete reworking of the parallelization scheme to target multi-scale wavenumber resolution on multi-core architectures. One feature of GYRO (and of Eulerian codes in general) that has been borne out by a decade of code results is that the grid resolution can be significantly less that in competing particle-in-cell codes due to high-order discretization methods and clever choice of coordinates. In GYRO, for a typical (not multi-scale) production run, we need about 180 radial gridpoints (5th order or higher method), 16 complex toroidal modes (spectral), 10 poloidal arc points per orbit (3rd order), and 128 velocity gridpoints (spectral). For multi-scale runs, the resolution requirements increase significantly with maximum resolved wavenumber and these challenging cases are becoming the norm for GYRO users.

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

Performance is not optimal on multi-core architectures, but this is not specific to NERSC. 

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using

 NERSC  OLCF  ACLF  NSF Centers  Other:  

Architectures Used

 Cray XT  IBM Power  BlueGene  Linux Cluster  Other:  

Total Computational Hours Used per Year

 30,000,000 Core-Hours

NERSC Hours Used in 2009

1,200,000 Core-Hours

Number of Cores Used in Typical Production Run

512

Wallclock Hours of Single Typical Production Run

12

Total Memory Used per Run

512 GB

Minimum Memory Required per Core

 1 GB

Total Data Read & Written per Run

 GB

Size of Checkpoint File(s)

 4 GB

Amount of Data Moved In/Out of NERSC

 GB per  

On-Line File Storage Required (For I/O from a Running Job)

 TB and  Files

Off-Line Archival Storage Required

 TB and  Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

 

4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year

5,000,000

Anticipated Number of Cores to be Used in a Typical Production Run

1024

Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above

24

Anticipated Total Memory Used per Run

1024  GB

Anticipated Minimum Memory Required per Core

 2 GB

Anticipated total data read & written per run

 GB

Anticipated size of checkpoint file(s)

 8 GB

Anticipated Amount of Data Moved In/Out of NERSC

 GB per  

Anticipated On-Line File Storage Required (For I/O from a Running Job)

 TB and  Files

Anticipated Off-Line Archival Storage Required

 TB and  Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

The mathematical methods and overall code base are proven. However, significant re-optimization is needed now that architectures have evolved to multi-core and users are increasingly focusing on multi-scale simulations to more accurately capture the full electron energy transport physics.

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 1 μs).

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.

Assistance with code profiling and performance tuning is more critical than ever. 

4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

we are moving towards factorizing GYRO into discrete components (computational kernels) that can be optimized or otherwise rewritten on a per-architecture basis. We have avoided this in the past because of the difficulty of maintaining multiple sources, but in the future this may be unavoidable. 

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).

The evolution of HPC resources has not been in an optimal direction from our perspective. Centers like the OLCF have imposed a paradigm by which codes are supposed to run with more and more cores in a fixed amount of time (2-12 hours). However, in reality, increasing spatial resolution can bring in shorter timescales, so that even if codes scale well with increased spatial resolution, the total number of timesteps and thus total wallclock time must increase. Thus, rather than very large core counts for the standard 2-12 hours, we need modestly increased core counts for significantly longer times (24-28 hours). Generally speaking, what is good for science are almost always large ensembles of runs (hundreds) which explore parameter space, never a few runs which approach the full machine size. So, as always, we think dedicated access with short wait-time to moderate core counts is the most useful service a computing center can provide. We find that compute capacity is more important (perhaps much more important) for the progress of science than ultimate compute capability.