Preparing for Perlmutter: DGX-A100 Nodes Arrive at NERSC
October 5, 2020
By NERSC Staff Jack Deslippe and Doug Jacobsen
This contribution is the first in a new series of blog posts written directly by the engineers at NERSC who keep the center running. In this series, we hope to share our enthusiasm for HPC and scientific computing and reveal some of the insider details about what makes NERSC NERSC.
At the end of 2020, NERSC will be receiving the first phase of Perlmutter, a Cray/HPE system that will include more than 6,000 recently announced NVIDIA A100 (Ampere) GPUs. The A100 GPUs sport a number of new and novel features we think the scientific community will be able to harness for accelerating discovery. A few of these features include:
- Tensor core support for FP64 and TF32 data types that provides 19.5TF and 156TF peak compute performance, respectively
- Multi-Instance GPU (MIG) support that breaks up a single A100 GPU into multiple smaller GPUs from a system, user and programmer perspective
- 40GB of high bandwidth memory (HBM), a significant jump from the 16GB common on the previous generation V100s in HPC deployments.
- 1.6 TB/s of HBM bandwidth, which is about 1.75x the previous generation
- Acceleration for certain AI-related and sparse-data operations
- Significant increases in cache sizes over the previous generation: 40MB L2 Cache (7x V100) and 192KB of L1/Shared-Memory per SM (1.5x V100).
How are we getting ready to deploy a GPU system at NERSC and prepare science codes to use these new capabilities? Great question! Through the NERSC Exascale Science Applications Program (NESAP) for Perlmutter, we’re already working with approximately 25 different code teams to ready their applications for the Perlmutter architecture. In addition, last year we acquired and installed on Cori 144 NVIDIA V100 (Volta) GPUs across 18 nodes that make up the “Cori-GPU” partition. Since then, our staff, postdocs, and partners have been using these nodes to gain expertise in deploying, configuring, and managing GPU resources, along with expertise in developing, analyzing, and optimizing simulation, data, and learning applications on the V100s.
But what about all those new A100 features? While Perlmutter continues to bake for a couple more months, we’ve just deployed two NVIDIA DGX-A100 server nodes, which have been added to the Cori-GPU partition. Each node contains two AMD Rome Epyc processors and 8 A100 GPUs, connected via NVLINK-3 (the same technology that will connect GPUs in Perlmutter).
Deploying a couple of servers with A100 GPUs has allowed our staff to gain experience in delivering the hardware to users. The nodes were set up to match the rest of the Cori GPU nodes, including SLURM job scheduling. To as large a degree as possible, the nodes share the same configuration as the rest of the center, allowing the use of NERSC authentication and home directories, the Community and Cori Scratch file systems, and access to center resources like Spin.
Fellow NERSC staffer Muaaz Awan is coordinating internal and NERSC partner access to the DGX-A100 nodes. NERSC staff have already begun running NESAP applications on the nodes using the latest NVIDIA SDK, with updates for the A100. Teams are already seeing speedups of up to 2x when compared to the older V100 GPUs, with performance improvements depending on what aspect of the GPU hardware is stressed in the application.
Awan described to us the opportunity to work with the DGX-A100 nodes: “It’s been a highlight, being among the first few to get hands on with the latest NVIDIA GPUs. While the hardware comes with significant upgrades, from a developer’s perspective little effort is needed for our applications to run on these devices. A lot of the performance improvements we are getting from the A100s don’t require extra effort. This is serving as a great launchpad toward Perlmutter.”
Being part of the NERSC team comes with a tremendous sense of wonder - and responsibility - to the scientific community we serve, who utilize this facility for world-changing discovery. Some days, the job is awe-inspiring in itself - such as getting to be one of the first in the world to run on new technology like the DGX-A100.
We’re looking forward to working with our NESAP teams and other partners on the new DGX nodes and are excited to deploy A100 GPUs for the entire DOE Office of Science community when Perlmutter arrives.