NERSC-8 System: Cori
NERSC's next supercomputer system, named after American biochemist Gerty Cori, will be a Cray system based on a next-generation Intel® Many Integrated Core (MIC) Architecture. The system will have a sustained performance that is at least ten times that of the NERSC-6 "Hopper" system, based on a set of characteristic benchmarks. Some important characteristics (updated June 24, 2014) of the system include:
- Self-hosted architecture, not a co-processor, not an accelerator
- Next-generation Intel® Xeon Phi™ Knights Landing (KNL) product with improved single thread performance targeted for highly parallel computing
- Intel® "Silvermont" architecture enhanced for high performance computing
- Over 9,300 single-socket nodes in the system with each node > 3TeraFLOPS/s theoretical peak performance
- Better performance per watt than previous generation Xeon Phi™ systems and 3X single-thread performance
- MPI + OpenMP programming model
- AVX512 Vector pipelines with a hardware vector length of 512 bits (eight double-precision elements)
- On-package, high-bandwidth memory, up to 16GB capacity with bandwidth projected to be 5X that of DDR4 DRAM memory
- 64-128 GB of DRAM memory per node
- Greater than 60 cores per node with support for four hardware threads each; more cores than current generation Intel Xeon Phi™
- Cray Aries high speed "dragonfly" topology interconnect, cabinets, and cooling (same as in Edison)
- Lustre filesystem with > 430 GB/sec I/O bandwidth and 28 PB of disk capacity
- Intel, Cray, and GNU programming environments
Installation will be mid-2016 in the new CRT building in Berkeley.
Preparing NERSC Users for NERSC-8
A key element of NERSC-8 involves beginning to transition the broad NERSC computational workload to energy efficient, "manycore" architectures. We at NERSC fully recognize that this may be a challenge for some projects. Although it will be easy to get applications running on NERSC-8, it is expected that many applications will require code modification to achieve good performance.
NERSC has already begun a comprehensive "Application Readiness" effort to successfully meet this challenge (see below). However, it is important that users also prepare for the new machine on their own via two systems that exist at NERSC right now.
In a nutshell, the challenge of getting good performance on NERSC-8 involves three things:
- Finding more parallelism in your simulation code;
- Expressing different kinds of parallelism than what your code may have now, meaning at a different level of granularity and with a different programming model;
- Dealing with a deeper and less abundant memory hierarchy.
Using Edison to Help Prepare for NERSC-8
You can use Edison today to start preparing for Cori. Most codes on Hopper and Edison assign a single MPI rank to each processing element; however, on NERSC-8, this may not be possible, due to memory limitations. On NERSC-8 it will be necessary to use less MPI parallelism and more OpenMP parallelism. In many codes, OpenMP threads can be added incrementally, via pragmas or compiler directives. In other codes, more restructuring may be needed. You should try adding OpenMP to your code on Edison today. Think about how your problem can be decomposed using thread-level parallelism, meaning independent threads of execution within each MPI process. Doing this on Edison is not only a good preparation for NERSC-8; it may also improve performance on Edison and/or allow you to run larger simulations.
Both NERSC-8 and Edison utilize vector processing but it will be particularly important on NERSC-8. Vector hardware improves speed and energy efficiency by issuing certain instructions that carry identical operations on different data. If you write Fortran DO (or C "for") loops, in many cases the compiler will generate vector instructions automatically. You can (and should) use Edison today to determine how much vectorization is present in your code. Many common programming constructs can prevent or obscure vectorization, meaning that loops are often written in a way that prevents the compiler from vectorizing them. Using Edison you can have the compiler emit diagnostics to tell you if loops are vectorized or if a programming contruct or data dependence prohibits it. Sometimes, loops can easily be rewritten; sometimes, entire algorithms will have to be restructured for good vector performance. Good vector performance means several things, amonst which are: all loops in the computationally intensive portions of your code vectorize and the loops lengths are large.
NERSC has begun preparing extensive documentation about vectorization. We will continue to update this web page as we gain more experience.
Does vectorization matter on Edison? Yes! Your code could run twice as fast on Edison if it is properly vectorized. See the documentation above for examples.
Using the NERSC Babbage Test Bed to Help Prepare for NERSC-8
NERSC's Babbage test bed system contains Intel "Knights Corner" MIC architecture processors and it can be used to begin the process of transitioning to manycore computing because it contains 60 computational cores per chip. You should use this system in "native" mode because this is how NERSC-8 will be used. Babbage can be used to compile codes, to enhance vectorization and parallel threading, and to run in a relatively memory-poor environment. However, because the processor is a previous, early generation of the Intel manycore architecture, and because MPI performance between MIC cards across nodes on Babbage is not optimal (since communication has to go through PCIe connection between host nodes and MIC cards), you should not assume that performance on Babbage is in any way representative of NERSC-8. You should, however, use Babbage to improve single-node performance of your code, especially vectorization and thread scalability. See the Babbage web pages for more information on this system.
The NERSC Application Readiness Effort
NERSC began its Application Readiness effort over a year ago to study thread parallelism and vectorization in key codes on local test beds. The first result of this effort is compendium of case studies reflecting porting effort, performance results, best practices, and common issues observed using real codes. A key finding from this effort is that although the transition can be disruptive in terms of programming – codes must be modified in various ways to achieve good performance – a positive outcome is that the resulting codes can also perform better on traditional architectures.
Going forward, NERSC will employ a multipronged effort to assist users. The NERSC Exascale Science Application Program (NESAP) is a key part of this. We will also actively collaborate with and learn from other sites and various research efforts throughout the HPC community.
The NERSC-8 contract includes an option for a “Burst Buffer,” a layer of Non-Volatile Random-Access Memory (NVRAM) designed to accelerate I/O performance for some programs. NVRAM is the technology typically used in "Flash" memory. The precise architecture of the NERSC-8 Burst Buffer and the way in which users would access it, should the contract option be exercised, is still being worked out.
Questions About NERSC-8
What is a manycore architecture?
Although there's no precise definition, especially since architectures evolve over time, manycore processors generally have a large number of computational cores on a single die with a scalable on-die network. A key difference between "manycore" and "multicore" is that current manycore architectures have relatively simplified, lower-frequency computational cores offering somewhat less instruction-level parallelism, engineered for lower power consumption and heat dissipation. Multicore chips evolved somewhat slowly, from 2 to about 12 cores per chips that consume roughly the same energy as a single-core chip did, whereas manycore chips represent more of a leap, to 50-100 cores per chip that consume the same or less energy per core. Again, a key distinction is that in manycore, core count essentially matters more than individual core performance, whereas in multicore chips, individual core performance is not sacrificed. Another key distinction is in programming, because multicore processors emphasize instruction-level parallelism that is obtained automatically by the hardware, whereas manycore processors emphasize thread-level and data-level parallelism that generally must be explicitly expressed by the program(er).
Why is NERSC acquiring a manycore architecture?
NERSC directly supports DOE’s science mission. We focus on the scientific impact of our users but the only way to continue to provide compute speed improvements that meet user need is to move to more energy-efficient architectures. Throughout the computing industry, because of unavoidable power constraints, processor speeds have stalled, despite continued transistor density increases. Instead of improving clock frequency, chip vendors have been increasing the number of cores on a chip and this trend is expected to continue. Vendors are also simplifying the hardware and instruction set architectures of those cores, as a way of reducing power consumption. These changes are directly related to those required to achieve Exascale computing levels. Exascale computing will provide the computational resources needed to solve science challenges in DOEs mission and NERSC will align its acquisitions with Exascale-relevant architectures. But such changes will affect computing at all levels and it is vital to ensure that DOE applications continue to efficiently harness the potential of commercial hardware.
DOE realizes that extreme scale computing cannot be achieved by a "business-as-usual," evolutionary approach and that its missions pushing the frontiers of science and technology must be carried out using affordable power consumption. NERSC is attempting to make this transition only once. We made NERSC-7 (Edison) an x86‐based system because our broad user base wasn’t ready in 2013 for GPUs, accelerators or coprocessors. NERSC is expecting a smooth progression to Exascale from a user's point of view -- usable Exascale computing.
Who does this affect?
These changes affect computing at all scales and NERSC is not alone. NERSC will leverage best practices from the LCFs, ACES, and others in the HPC community in preparing its users for the NERSC-8 architecture. Since future gains in supercomputing will continue to be limited by power, increased parallelism is a key ingredient in reaching the Exascale level of computing.
How did NERSC select this system?
NERSC-8 was selected using the open competitive acquisition strategy that NERSC has now used for eight generations of supercomputers. The method is referred to as Best Value Source Selection (BVSS).
Why is vectorization important on NERSC-8?
Architectures such as the Xeon Phi are more energy efficient, in part, because they rely on vector processing, which can generate multiple operations on different data using a single instruction and which requires simpler processor hardware control logic. Vector performance depends on several program characteristics, including the fraction of computation that can use the vector hardware and the efficiency with which that vector hardware is used. The Edison vector pipelines are 256 bits wide (4 double-precision words) while the NERSC-8 pipelines are 512 bits wide, meaning that longer vectorizable loop lengths will be required to achieve good vector efficiency. The main (non-vector) computational core in NERSC-8 is simpler than that in Edison and so a bigger portion of achieved overall speed may depend on effective use of the vector units.
Who is Cori?
Gerty Cori (1896 – 1957) was an American biochemist who became the first American woman — and third woman overall—to be awarded a Nobel Prize in science. She shared the 1947 Prize in Medicine or Physiology with her husband (and one other), with whom she had published nearly 80 papers detailing landmark discoveries related to enzyme and carbohydrate chemistry. The Cori Cycle, a fundemental process used to create energy in cells, and the Cori Ester, are named after them. They were the first to ever isolate an allosteric enzyme and explain its function. She was also the first to show that a defect in an enzyme can be the cause of a human genetic disease.