NERSC-8 System: Cori
NERSC's next supercomputer system, named after American biochemist Gerty Cori, will be a Cray system based on the second generation of Intel® Xeon Phi™ Product Family, called Knights Landing (KNL) Many Integrated Core (MIC) Architecture. The system will have a sustained performance that is at least ten times that of the NERSC-6 "Hopper" system, based on a set of characteristic benchmarks. Some important characteristicsof the system include:
- Knights Landing is a self-hosted architecture, not a co-processor, not an accelerator. "Self-hosted" means that it is a standalone bootable processor (running host OS)
- Next-generation Intel® Xeon Phi™ Knights Landing (KNL) product with improved single thread performance targeted for highly parallel computing. The KNL will have over 8 billion transistors per die based on Intel’s 14 nanometer manufacturing technology.
- Intel® "Silvermont" architecture enhanced for high performance computing; will feature 2X the Out-of-Order Buffer Depth of current Silvermont, Gather/scatter in hardware, Advanced Branch Prediction, 32KB Icache and Dcache, 2 x 64B load ports in Dcache, and 46/48 Physical/Virtual Address bits to match Xeon
- Over 9,300 single-socket nodes in the system with each node > 3TeraFLOPS/s theoretical peak performance
- Better performance per watt than previous generation Xeon Phi™ systems and 3X single-thread performance
- MPI + OpenMP programming model
- AVX512 Vector pipelines with a hardware vector length of 512 bits (eight double-precision elements)
- On-package, high-bandwidth memory, up to 16GB capacity with bandwidth projected to be 5X the bandwidth of DDR4 DRAM memory, (>400 GB/sec); Over 5x Energy Efficiency vs. GDDR52; Over 3x Density vs. GDDR52. This MCDRAM memory will have flexible memory modes, including cache and flat.
- Greater than 60 cores per node with support for four hardware threads each; more cores than current generation Intel Xeon Phi™
- Processor cores will be connected in a 2D mesh network with 2 cores per tile, with a 1-MB cache-coherent L2 cache shared between 2 cores in a tile, with two vector processing units per core, and with Multiple NUMA domain support per socket
- KNL will have multiple NUMA domains available per socket
- 96 GB DDR4 memory per node using 6 channels
- Intel, Cray, and GNU programming environments
- Cray Aries high speed "dragonfly" topology interconnect, cabinets, and cooling (same as in Edison)
- Lustre filesystem with > 700 GB/sec I/O bandwidth and 28 PB of disk capacity
- System will include a 'burst buffer',a layer of NVRAM between memory and disk intended to accelerate certain I/O operations. The Burst Buffer will have >1.5PB of capacity provide >1.5TB/sec I/O bandwidth
Installation will be mid-2016 in the new CRT building in Berkeley.
Preparing NERSC Users for Cori
A key element of Cori involves beginning to transition the broad NERSC computational workload to energy efficient, "manycore" architectures. We at NERSC fully recognize that this may be a challenge for some projects. Although it should be easy to get applications running on Cori, it is expected that many applications will require code modification to achieve good performance. NERSC launched the NERSC Exascale Science Application Program (NESAP) program in the fall of 2014 to closely partner with selected application, library and tools teams. The lessons learned through the NESAP program will be used to launch training programs and document case studies that will be available to our broad user community. We will also actively collaborate with and learn from other sites and various research efforts throughout the HPC community.
NERSC began its Application Readiness effort over a year ago to study thread parallelism and vectorization in key codes on local test beds. The result of this effort is compendium of case studies reflecting porting effort, performance results, best practices, and common issues observed using real codes. A key finding from this effort is that although the transition can be disruptive in terms of programming – codes must be modified in various ways to achieve good performance – a positive outcome is that the resulting codes can also perform better on traditional architectures.
In a nutshell, the challenge of getting good performance on Cori involves three things:
- Finding more parallelism in your simulation code;
- Expressing different kinds of parallelism than what your code may have now, meaning at a different level of granularity and with a different programming model;
- Managing mulitple layers of memory and lower amounts of memory per core.
Using Edison to Help Prepare for Cori
You can use Edison today to start preparing for Cori. Most codes on Edison assign a single MPI rank to each processing element; however, on Cori, this may not be possible, due to memory limitations. On Cori it will be necessary to use less MPI parallelism and more OpenMP parallelism. In many codes, OpenMP threads can be added incrementally, via pragmas or compiler directives. In other codes, more restructuring may be needed. You should try adding OpenMP to your code on Edison today. NERSC has collected some useful information and tutorials in the OpenMP Resources page. Think about how your problem can be decomposed using thread-level parallelism, meaning independent threads of execution within each MPI process. Doing this on Edison is not only a good preparation for Cori; it may also improve performance on Edison and/or allow you to run larger simulations.
Both Cori and Edison utilize vector processing but it will be particularly important on Cori. Vector hardware improves speed and energy efficiency by issuing certain instructions that carry identical operations on different data. If you write Fortran DO (or C "for") loops, in many cases the compiler will generate vector instructions automatically. You can (and should) use Edison today to determine how much vectorization is present in your code. Many common programming constructs can prevent or obscure vectorization, meaning that loops are often written in a way that prevents the compiler from vectorizing them. Using Edison you can have the compiler emit diagnostics to tell you if loops are vectorized or if a programming construct or data dependence prohibits it. Sometimes, loops can easily be rewritten; sometimes, entire algorithms will have to be restructured for good vector performance. Good vector performance means several things, amongst which are: all loops in the computationally intensive portions of your code vectorize and the loops lengths are large.
NERSC has begun preparing extensive documentation about vectorization. We will continue to update this web page as we gain more experience.
Does vectorization matter on Edison? Yes! Your code could run twice as fast on Edison if it is properly vectorized. See the documentation above for examples.
Using the NERSC Babbage Test Bed to Help Prepare for Cori
NERSC's Babbage test bed system contains Intel "Knights Corner" MIC architecture processors and it can be used to begin the process of transitioning to manycore computing because it contains 60 computational cores per chip. You should use this system in "native" mode, mimicking a self-hosted processor, because this will be the most similar mode to the Cori architecture. Babbage can be used to compile codes, to enhance vectorization and parallel threading, and to run in a relatively memory-poor environment. However, because the processor is a previous, early generation of the Intel manycore architecture, and because MPI performance between MIC cards across nodes on Babbage is not optimal (since communication has to go through PCIe connection between host nodes and MIC cards), you should not assume that performance on Babbage is in any way representative of Cori. You should, however, use Babbage to improve single-node performance of your code, especially vectorization and thread scalability. See the Babbage web pages for more information on this system.
|Mission Need Approved||Nov 2012|
|Cori RFP Released||Aug 2013|
|Cori contract awarded to Cray and public system announcement||Apr 2014|
|NERSC Exascale Science Application Program (NESAP) Call for Proposals||June 2014|
|NESAP kickoff meeting||Sep 2014|
|Knights Landing Whitebox testbeds available for NESAP users||Jan 2016|
|Cori system delivery (expected)||June 2016|
|NESAP Early Users on Cori||Sep 2016|
Questions About Cori
What is a manycore architecture?
Although there's no precise definition, especially since architectures evolve over time, manycore processors generally have a large number of computational cores on a single die with a scalable on-die network. A key difference between "manycore" and "multicore" is that current manycore architectures have relatively simplified, lower-frequency computational cores offering somewhat less instruction-level parallelism, engineered for lower power consumption and heat dissipation. Multicore chips evolved somewhat slowly, from 2 to about 12 cores per chips that consume roughly the same energy as a single-core chip did, whereas manycore chips represent more of a leap, to 50-100 cores per chip that consume the same or less energy per core. Again, a key distinction is that in manycore, core count essentially matters more than individual core performance, whereas in multicore chips, individual core performance is not sacrificed. Another key distinction is in programming, because multicore processors emphasize instruction-level parallelism that is obtained automatically by the hardware, whereas manycore processors emphasize thread-level and data-level parallelism that generally must be explicitly expressed by the program(er).
Why is NERSC acquiring a manycore architecture?
NERSC directly supports DOE’s science mission. We focus on the scientific impact of our users but the only way to continue to provide compute speed improvements that meet user need is to move to more energy-efficient architectures. Throughout the computing industry, processor speeds have stalled because of unavoidable power constraints, despite continued increases in transistor density. Instead of improving clock frequency, chip vendors have been increasing the number of cores on a chip and this trend is expected to continue. Vendors are also simplifying the hardware and instruction set architectures of those cores, as a way of reducing power consumption. These changes are directly related to those required to achieve Exascale computing levels. Exascale computing will provide the computational resources needed to solve science challenges in DOEs mission and NERSC will align its acquisitions with Exascale-relevant architectures. But such changes will affect computing at all levels and it is vital to ensure that DOE applications continue to efficiently harness the potential of commercial hardware.
DOE realizes that extreme scale computing cannot be achieved by a "business-as-usual," evolutionary approach and that its missions pushing the frontiers of science and technology must be carried out using affordable power consumption. NERSC is attempting to make this transition only once. We made NERSC-7 (Edison) an x86‐based system because our broad user base wasn’t ready in 2013 for GPUs, accelerators or coprocessors. NERSC is expecting a smooth progression to Exascale from a user's point of view -- usable Exascale computing.
Who does this affect?
These changes affect computing at all scales and NERSC is not alone. NERSC will leverage best practices from the LCFs, ACES, and others in the HPC community in preparing its users for the Cori architecture. Since future gains in supercomputing will continue to be limited by power, increased parallelism is a key ingredient in reaching the Exascale level of computing.
How did NERSC select this system?
The Cori system, whose procurement was called "NERSC-8" was selected using the open competitive acquisition strategy that NERSC has now used for eight generations of supercomputers. The method is referred to as Best Value Source Selection (BVSS).
Why is vectorization important on Cori?
Architectures such as the Knights Landing Xeon Phi are more energy efficient, in part, because they rely on vector processing, which can generate multiple operations on different data using a single instruction and which requires simpler processor hardware control logic. Vector performance depends on several program characteristics, including the fraction of computation that can use the vector hardware and the efficiency with which that vector hardware is used. The Edison vector pipelines are 256 bits wide (4 double-precision words) while the Cori, Knights Landing processor pipelines are 512 bits wide, meaning that longer vectorizable loop lengths will be required to achieve good vector efficiency. The main (non-vector) computational core in Cori is simpler than that in Edison and so a bigger portion of achieved overall speed may depend on effective use of the vector units.
Who is Cori?
Gerty Cori (1896 – 1957) was an American biochemist who became the first American woman — and third woman overall—to be awarded a Nobel Prize in science. She shared the 1947 Prize in Medicine or Physiology with her husband (and one other), with whom she had published nearly 80 papers detailing landmark discoveries related to enzyme and carbohydrate chemistry. The Cori Cycle, a fundemental process used to create energy in cells, and the Cori Ester, are named after them. They were the first to ever isolate an allosteric enzyme and explain its function. She was also the first to show that a defect in an enzyme can be the cause of a human genetic disease.