NERSC and the HPC Community Bid Farewell to Cori Supercomputer
The Cray XC40 system ushered in a new era of supercomputing for science
May 17, 2023
By Kathy Kincade
After nearly seven years of service, thousands of user projects, and tens of billions of compute hours, the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) will be retired at the end of May. With its first cabinets installed in 2015 and the system fully deployed by 2016, Cori has been in service longer than any supercomputer in NERSC’s 49-year history and enabled more than 10,000 scientific publications. And its technological innovations reflect the dynamic evolution of high performance computing (HPC) over the past decade, paving the way for the next generation of scientific computing.
Cori was developed through a partnership with Intel, Cray (now HPE), and Los Alamos and Sandia National Laboratories. The Cray XC40 system was named in honor of biochemist Gerty Cori, the first American woman to win a Nobel Prize in science and the first woman to win a Nobel Prize for Physiology or Medicine. It comprises 2,388 Intel Xeon Haswell processor nodes, 9,688 Intel Xeon Phi Knight's Landing (KNL) nodes, and a 1.8 PB Cray Data Warp Burst Buffer and has a peak performance of ~30 petaflops; when it debuted in 2017, it ranked fifth on the TOP500 list.
Cori was also the first supercomputer to be installed from the ground up in Lawrence Berkeley National Laboratory’s (Berkeley Lab’s) then-new Shyh Wang Hall, influencing the building’s infrastructure and prompting the implementation of numerous energy-efficiency innovations on the system and throughout the facility. In addition, the introduction of Cori’s manycore KNL architecture changed the way NERSC interacts with users, leading to the implementation of the NERSC Exascale Science Applications Program (NESAP) and the NESAP post-doc program, both of which are still running strong as NERSC moves into the Perlmutter GPU era.
“Cori has been an exciting system for a number of reasons, including the fact that It was the first energy-efficient architecture that NERSC deployed,” said NERSC Director Sudip Dosanjh. “It was clear to us with the advent of exascale computing that to get more computational power we needed to go to an energy-efficient, manycore architecture.”
“Cori has been a workhorse for our center and a very productive environment for the user community,” added Katie Antypas, NERSC division deputy who has been involved with multiple procurements at NERSC, including Cori and Perlmutter. “It is also where we developed our data strategy, did Jupyter at scale for the first time, and were able to prototype and test a lot of the capabilities that are now on Perlmutter.”
The Impact of KNL, and More
Other innovative features introduced on Cori that have influenced current and next-generation architectures include the Burst Buffer (which laid the foundation for Perlmutter’s all-flash file system), high-bandwidth memory, increased vector capability, real-time queues, deep learning library support, Globus sharing connections, and workflow service nodes.
“With Cori, we deployed the manycore KNL, a more specialized processor that could yield higher performance,” said Jack Deslippe, who leads the Application Performance Group at NERSC, was involved with the procurement and development of Cori, and oversees the NESAP program. “It gave users the opportunity to use the high-bandwidth memory that the KNL processors have right on their chip and the manycore aspect of the chip, which had 68 cores – significantly higher than anything before that.”
The design and stability of the Cray Dragonfly interconnect and the machine’s cooling system, along with the modernization of the software stack, also enhanced its utility for users, noted Tina Declerck, division deputy for operations and project lead on the Cori installation and deployment.
“In terms of interconnectedness, the design team did a really good job of finding the right path through the system,” she said. “Cori has handled all kinds of network failures effectively, which allowed the science to continue without interruption via much faster communication, fewer hops, and faster node-to-node communication.”
Shyh Wang Hall, the building that has housed Cori and now Perlmutter, was also integral to this success, added Jeff Broughton, former deputy of operations at NERSC, who retired in 2022 after 13 years at NERSC.
“In 2015, we were building the new building and looking at what we had to do to get from the Oakland Scientific Facility (where NERSC had been located since 2001) to the Berkeley Lab campus with minimum impact on our users,” he said. “So we decided to acquire Cori Phase 1 (the Haswell partition) and install it in the new building so we could then turn off Edison and move it from Oakland to Berkeley Lab with essentially no disruption to service.”
The design of the building was uniquely influenced by the needs of the Cori system, Broughton added. “Normally, when we do site prep for a machine, all the electrical and plumbing work, etc., is done as part of the project. But in this case, it was done as part of the building construction.”
One key result of this was that the building was designed to run without any mechanical refrigeration for Cori and NERSC’s future supercomputing systems.
“Running big compressors to produce 50-degree F water is what consumes a huge amount of energy in a data center, so we decided early on to do the building without compressors, which is what allows NERSC to get its extraordinarily high energy efficiency,” Broughton said. “Serendipitously, Cray enabled us to do that by delivering a machine capable of running at 80 degrees F. Had they not been able to do that, we might have had to put refrigeration into the building or find an alternative.”
A New Design Approach
Cori’s impact on HPC and scientific computing goes well beyond its technological innovations. It also changed the way NERSC and others in the HPC community began to think about how supercomputers could be configured in ways that would better serve the user community. For example, prior to Cori, NERSC typically had a main supercomputer with several smaller clusters around it that served specific user communities, noted Pete Ungaro, former CEO of Cray, who continues to work in HPC as an industry consultant.
“Cori was where we worked to aggregate a lot of those unique capabilities in those unique systems into the main supercomputer itself,” he said. “NERSC had done this really interesting study for the Department of Energy about the cost of computing, and it showed that the small clusters around the supercomputer were more expensive to run and maintain than the main supercomputer platform. So we had a lot of discussions around how to bring these capabilities into the main supercomputer and make it more flexible and less monolithic. As a result, Cori made a dent in doing some unique things that people just weren’t able to do on supercomputers at the time.”
David Trebotich – a staff scientist in the Applied Numerical Algorithms Group at Berkeley Lab – has been one such user, and a prolific one at that. Cori has been instrumental in enabling him to scale up his research in entirely new ways, yielding previously unattainable results with his Chombo-Crunch software and beyond. His team’s numerous projects that have involved computing on Cori include subsurface flow and transport, paper manufacturing, and water desalination. He is also the principal investigator (PI) on the Chombo-Crunch project and application code development and performance portability lead on the ECP Subsurface project.
“I found Cori to be way more productive than I thought it was going to be early on, and the NESAP program had a lot to do with getting us up to production capability,” said Trebotich, whose simulations from some of his work are colorfully displayed in the Cori system panel art. “Among other things, we’ve been able to achieve really great performance with reduced memory footprint for high-resolution simulations of subsurface reactive transport processes and, in general, simulations of flow and transport in heterogeneous materials.”
“Cori gave the scientific community access to much bigger capability and resource,” Ungaro said. “Instead of having to use a smaller specialty cluster, they now could leverage this huge supercomputer to do bigger datasets with much higher throughput and try many more different experimentations.”
Users Take a Deeper Dive
Over the past decade, as the HPC and scientific communities began to move toward exascale and energy-efficient architectures, NERSC wanted to make sure the scientific community wasn’t left behind and could effectively use these next-generation systems, Deslippe noted. “A lot of what we have done with Cori, and now Perlmutter, has paved the way for the HPC ecosystem to continue transitioning.”
NESAP has been a key component of this evolution, and Cori was the catalyst for this program, Deslippe and others emphasized. Through NESAP, NERSC initially partnered with code teams and library and tool developers to prepare their codes for Cori’s manycore architecture. More recently, the program has done the same to help users optimize their applications for Perlmutter’s GPU architecture.
“Cori was the first time that NERSC worked with users to optimize for a new platform,” said Rebecca Hartman-Baker, who leads NERSC’s User Engagement Group. “The KNL manycore architecture was very unique and innovative at the time, and people needed some help in understanding how to use it. Once they did, it really took off.”
This dynamic has had a lasting impact on how NERSC engages with its users, Deslippe added. “It caused us to rethink how we engage with the user community and the application developer community and to work with them at a much deeper level than we had before,” he said. “We’d always had a consulting team to answer user questions and help them compile and build applications. But with Cori, we formed a new team to work with users directly on their codes and go into the trenches with them as they were preparing their applications for the new architecture.”
Part of this effort involved partnering with colleagues in Berkeley Lab’s Computational Research Division (now two divisions: Applied Math and Computational Research, and Scientific Data) to design tools and performance models – such as Roofline – that users could apply to understand performance in an absolute sense and better determine which directions would be profitable for them to target on the new processors, Deslippe added.
“Our users are typically scientists first and not necessarily computer architects, so what we wanted to do and were successful in doing was coming up with tools that would lower the barrier of entry and make it easier for users to understand performance on the system,” he said.
Cori also changed the way NERSC interacts with vendors, particularly from a design perspective, noted Nick Wright, who leads the Advanced Technologies Group at NERSC and was chief architect on Cori (NERSC-8), Perlmutter (NERSC-9), and now on NERSC-10 (in development).
“Cori was the first machine where we partnered with the vendors more than just bought the machine from them,” he said. “It was also the first system we procured jointly with LANL, Sandia, and Cray, and the first one where we did a non-recurring engineering (NRE) project (the Burst Buffer). The experience with the NRE taught us the value of strong and deep co-design with vendors.”
Wright sees many of these trends continuing for current and future system procurements, and he considers the Cori procurement pivotal to NERSC’s progression to exascale and beyond. “It is really clear, looking out into the future, that tighter co-design partnerships with vendors will be even more necessary,” he said.
In the long run, Cori laid the groundwork for a new generation of HPC architectures and served as a testing ground for many features that are now on Perlmutter and other supercomputing systems. It also enabled numerous ground-breaking scientific achievements, from environmental, chemical, energy, materials science, applied physics, and nuclear physics research to climate, biology, cosmology, and quantum computing simulations.
“Cori was a very exciting system to bring in,” Deslippe said. “It was an all-hands-on-deck activity to make Cori a productive system for users, and it took every bit of NERSC’s expertise. For NERSC staff, there was a lot of excitement around the challenge of deploying a first-of-its-kind system like this.”
“I look at Cori as another step in the evolution of HPC,” Broughton added. “Basically, it shows that NERSC continues to be on the leading edge of deploying new and novel systems that will help our users maintain their advantage in scientific computing.”
NERSC is a DOE Office of Science user facility.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.