For more than 45 years the National Energy Research Scientific Computing Center (NERSC) has been providing leading edge computational and data systems to support scientific discovery for the DOE Office of Science. Launched in 1974 as a computing resource for Fusion Energy Research at Lawrence Livermore National Laboratory, NERSC quickly expanded its role to include users from all SC program offices, and in 1996 moved to Lawrence Berkeley National Laboratory in 1996. Today its 8,000 users make NERSC perhaps the most scientifically productive high performance computing (HPC) center in the world, with NERSC users producing more than 2,500 referred publications annually. In addition, over the years the center has been associated with six Nobel Prize winning scientists or their teams.
NERSC places a premium on delivering cutting-edge technology at scale and making it highly usable and productive in its role as the mission HPC center for the DOE Office of Science. As a unique resource, the center has been a leader in fielding next-generation supercomputing systems like the Cray T3E-900, the IBM SP POWER 3, IBM POWER 5+, and the Cray XT4. NERSC’s recent supercomputers were the first of their kind: Edison, a Cray XC30, included the first two cabinets of its line developed by Cray and DARPA; and Cori, a Cray XC40, is the first and largest system with Intel Xeon Phi “Knights Landing” processors, debuting as the 5th most powerful supercomputer in the world in 2016. In addition to its computing capabilities, Cori was the first large system to deploy an all-FLASH “burst buffer” that provided a world-best 1.7 TB/sec file system bandwidth when it was introduced. The first Cray/HPE Shasta system will arrive in late 2020 as the NERSC Perlmutter supercomputer.
NERSC is also a large-scale data analysis and storage center: there is a net flow of data into the center. Its tape archive, currently based on the High Performance Storage System (HPSS) technology, holds 200 PBs of data, some of which dates back to the first days of the center in the 1980s. NERSC is also unique among DOE National Laboratory computing centers in that it has never deleted scientific data from its archive. The center currently hosts datasets of more than 5PB each in climate science, genomics, nuclear physics and from light source facilities, cosmic microwave background studies, and a number of neutrino and high-energy physics experiments. To help scientists address their growing data needs, NERSC created a Data Department in 2015 and has developed a vigorous program supporting deep learning for science. The center maintains state-of-the art external networking capabilities – provided by ESnet – to help scientists move data to the center for analysis and archiving. NERSC has a long history of supporting data-driven science; many high-energy and nuclear physics teams have used NERSC data systems and the PDSF cluster and for their analyses. NERSC also maintains a close collaboration with DOE’s Joint Genome Institute, which conducts high-throughput DNA sequencing, synthesis and analysis in support of BER’s bioenergy and environmental missions.
To ensure that scientists, whose expertise in HPC range from world-class to first-year graduate students, are as productive as possible NERSC puts a lot of effort into creating systems that are highly available while offering expert consulting and performance optimization support from a team largely drawn from the scientific community itself. Among its innovations were using web technologies to expose job-level detail to users in the early 2000s, pioneering science gateways, developing the “Shifter” container technology for HPC, recently enabling machine learning, data analytics software, and workflow frameworks at scale. In 1978 NERSC developed CTSS, the Cray Time Sharing System, to allow a remote user interface to its Cray 1 supercomputer, the center was the first to checkpoint a full distributed-memory supercomputer (its Cray T3E) in 1997, and launched the DOE INCITE program in 2003.
World-Class Computing for Nearly 50 Years
The founding of the world’s first unclassified supercomputing center began in 1973 when Dr. Alvin Trivelpiece, then deputy director of the Controlled Thermonuclear Research (CTR) program of the Atomic Energy Commission, solicited proposals for a computing center that would aid in reaching fusion power, giving the magnetic fusion program under CTR access to computing power similar to that of the defense programs. Lawrence Livermore National Laboratory was chosen as the site for the new center which would be called the CTR Computer Center (CTRCC), later renamed the NMFECC (National Magnetic Fusion Energy Computer Center), and eventually NERSC. Starting with a cast-off CDC 6600, within a year from its inception the center added a new CDC 7600 and provided, for the first time, a remote access system that allowed fusion energy scientists at Oak Ridge National Laboratory and Los Alamos National Laboratory (LANL), as well as the General Atomics research center in southern California, to communicate with the centralized computers.
The center continued to deploy leading-edge systems and in 1978, NMFECC developed the Cray Time Sharing System (CTSS), which allowed remote users to interface with its Cray-1 supercomputer. At the time, computers were essentially custom machines, delivered without software, leaving centers to develop their own. Due to its success, CTSS was eventually adopted by multiple computing centers, including the National Science Foundation (NSF) centers established in the mid-1980s in Chicago, Illinois and San Diego, California. In 1985, when ORNL deployed a Cray X-MP vector processing system, the system also ran CTSS. NERSC next deployed the first four-processor system, the 1.9- gigaflop Cray-2, which replaced the Cray X-MP as the fastest in the world. Having been prepared for multitasking, CTSS allowed users to run on the Cray-2 just one month after delivery.
In 1983, the NMFECC opened its systems to users in other science disciplines, allocating five percent of system time to the other science offices in DOE’s Office of Energy Research, paving the way for a broader role of computation across the research community. By 1990, the center was allocating computer time to such a wide range of projects from all of the Office of Energy Research offices that the name was changed to NERSC.
The growing number of users and increased demand for computing resources led Trivelpiece, then head of DOE’s Office of Energy Research, to make another decision that mapped out a path for making those resources more widely accessible. He recommended that DOE’s Magnetic Fusion Energy network (MFEnet) be combined with the High Energy Physics network (HEPnet), to become ESnet (Energy Sciences Network) in 1986. ESnet’s roots stretch back to the mid-1970s, when staff at the CTRCC installed four acoustic modems on the center’s CDC 6600 computer.
As part of the High Performance Parallel Processing project with LANL, NERSC deployed a 128-processor Cray T3D machine, the first large-scale, parallel system from Cray Research, in 1994. The machine was used in a national laboratory-industry partnership to advance the development of parallel codes and upgraded to 256 processors within a year.
In 1996 NERSC moved from Livermore to Lawrence Berkeley National Laboratory acquired the Cray T3E-600 system, its first massively parallel processor architecture machine which was upgraded to a T3E-900 the following year. The system brought with it a fundamental change in the computing environment, making it possible for scientists to perform larger and more accurate simulations. It also had the largest I/O system built to date with 1.5 terabytes of disk storage and a read/write capability of 800 megabytes. Ranked No. 5 on the TOP500 list, this system, named MCurie, was the most powerful computer for open science in the U.S. NERSC’s upgraded T3E- 900 provided the training platform for a materials science project led by ORNL’s Malcolm Stocks, whose code was the first application to reach a sustained performance of 1 teraflop.
By 2003, NERSC was supporting more than 4,000 users from all the Office of Science program offices, and requests for time on its systems were three times what was available. At the direction of Office of Science Director Raymond Orbach, NERSC launched the INCITE (Innovative & Novel Computational Impact on Theory & Experiment) program, which created a system for scientists to apply for and receive large allocations of time on NERSC computing resources. INCITE was expanded to include the leadership computing facilities (LCF) in 2006 and the program is now supported by the ANL and ORNL facilities.
In November 2015, Berkeley Lab opened Shyh Wang Hall, a 149,000-square-foot facility housing NERSC, ESnet, and researchers in the laboratory’s Computational Research Division. The facility is one of the most energy-efficient computing centers anywhere, tapping into the San Francisco Bay’s mild climate to cool NERSC’s supercomputers and eliminate the need for mechanical cooling and earned a Gold LEED (Leadership in Energy and Environmental Design) certification from the U.S. Green Building Council.
The facility soon became home to NERSC’s next system, Cori, a 30-petaflop/s Cray system with Intel Xeon Phi (Knights Landing) processors. With 68 low-power cores and 272 hardware threads per node, Cori was the first system to deliver an energy-efficient, pre-exascale architecture for the entire Office of Science HPC workload. In 2019, NERSC began preparing for the installation of its next-generation, pre-exascale Perlmutter system, a Cray Shasta machine which will be a heterogeneous system comprising both CPU-only and GPU-accelerated cabinets.
Text in this section was derived from [email protected].
Advancing High Performance Computing and Data
As a leading HPC center NERSC is developing technologies that advance the state of the art.
In the emerging field of deep learning for science, training performance on the Cori supercomputer at NERSC was enabled at 15 petaflops in 2017, giving climate scientists the ability to use machine learning to identify extreme weather events in the output of huge climate simulations. Analyzing these datasets is challenging so researchers from NERSC, Intel, the Montreal Institute for Learning Algorithms, and Microsoft Research teamed up to create a novel, semi-supervised convolutional deep neural network (DDN). Predictive accuracies ranging from 89.4% to as high as 99.1% showed that DDNs can identify weather fronts, tropical cyclones and atmospheric rivers.
In 2015 NERSC developed and released “Shifter,” a container technology that allows users to bring their custom compute environment to NERSC’s supercomputers. Shifter is based on the Docker container technology, extending its use to HPC systems. Shifter was originally inspired by the need to improve the flexibility and usability of HPC systems for data-intensive workloads, but use cases are expanding to include general HPC workloads. Soon after its initial deployment numerous experimental facilities and academic institutions found that Shifter makes it much easier for them to run their data-centric workloads in an HPC environment. In 2016 the supercomputing company Cray adopted Shifter as an official product. In 2016 NERSC demonstrated that Shifter be used to run complex, scientific, Python-based codes in parallel on more than 9,000 nodes on Cori using it’s Intel Xeon Phi processors. Shifter is currently a finalist for a prized 2018 R&D 100 Award.
"The supercomputing community continues to evolve in our shared quest for discovery and scientific breakthroughs," said Ryan Waite, Cray's senior vice president of products. "We are seeing an increasing number of developers using new technologies to solve their problems. We are delighted to have partnered with NERSC in the development of this important technology."
When a Cray system based on power-efficient Intel Xeon Phi “Knights Landing (KNL)” cores was selected for its “NERSC-8” (aka Cori) system, NERSC knew its large user base would need help porting their codes to run efficiently on that architecture. In 2014 it started the NERSC Exascale Science Application Program (NESAP_ to enable leading teams and their codes to run efficiently at scale on Cori. NERSC worked collaboratively with 20 application teams, Cray and Intel to prepare key applications for Cori. By the time the system went into production in 2017, NESAP applications had improved their performance on KNL by 350%. At the SC14 conference in New Orleans, NERSC received HPCWire’s 2014 Editors’ Choice Award for Best HPC Collaboration Between Government & Industry. The award recognized NERSC’s partnership with Intel and Cray in preparation for Cori.
NERSC introduced the world’s first HPC all-FLASH file system or “burst buffer” on Cori in 2015. Based on Cray’s DataWarp technology Cori’s burst buffer achieved a world-best 1.7 TB/second of peak I/O performance with 28M IO operations per second, and about 1.8PB of storage. The burst buffer greatly improves I/O performance, particular for codes that are I/O heavy, but can use streaming large-block techniques. Many data analytics applications fall into this category, as their data can often be highly complex and unstructured. The paper "Accelerating Science with the NERSC Burst Buffer", won the “Best Paper” at the 2016 Cray User Group meeting.