NERSC Partners with Cray, ESnet to Bring Software Defined Networking to Cori
Highly capable networking 'a critical component' of the science discovery workflow
November 1, 2016
Detectors and imaging facilities coming online in the next three to five years are expected to produce data in excess of 1 terabyte per second. These scientific data-intensive workloads require systems that have the ability to ingest and process data from scientific instruments and sensor networks. However, a major challenge is enabling high performance computing systems like the Cori supercomputer at Berkeley Lab’s National Energy Research Scientific Computing Center (NERSC) to effectively ingest data from these instruments.
“Facilities are increasingly looking to supercomputer centers to meet their computing needs, but step one is getting the data into the supercomputer,” said Shane Canon, Project Engineer in the Data and Analytics Services Group at NERSC. “Unfortunately, current approaches aren’t sufficient.”
This dilemma prompted NERSC, in partnership with Cray, to explore new ways to more efficiently move data in and out of Cori, a Cray XC40. One potential solution is to integrate software defined networking (SDN) capabilities to allow scientists at experimental facilities such as the Linac Coherent Light Source at SLAC to co-schedule networking bandwidth, compute resources and Burst Buffer bandwidth.
Software defined networking encompasses several kinds of network technologies to make the network as agile and flexible as the virtualized server and storage infrastructure found in the modern data center. The goal of SDN is to allow network engineers and administrators to respond quickly to changing user needs.
“We need to take advantage of a network guru’s design for moving data for a specific experiment but have SDN do all of the bookkeeping for which compute nodes need to be connected to what networks,” said Brent Draney, group lead for the Networking, Servers and Security Group at NERSC. “I would rather see our network engineers analyze the data flow and how to meet the need instead of having to manually reconfigure the network for the demands of each job.”
Incorporating SDN into the NERSC workflow requires being able to provide dynamic scheduling and provisioning of networking end points that map to compute resources in Cori. NERSC engineers are initially focusing on basic capabilities to do high-bandwidth network address translation from the private address space used inside Cori to routable public addresses. Software-based routers running Brocade's vRouter will be connected to this Ethernet network and will be responsible for routing IP traffic to external networks. These routers are also associated with an SDN controller based on the OpenDaylight Controller. Eventually this controller will be integrated with the SLURM scheduler to handle dynamic provisioning.
“The SDN gateway nodes are going in now as part of the Cori Phase 1 and 2 integration, which features the Intel Xeon Phi ‘Knights Landing’ manycore processors,” said Jason Lee, deputy of networking and security at NERSC. “Our goal is to allow the network to become a schedulable resource, which would enable jobs and devices to schedule time on the computer and bandwidth on the network at the same time, and then run in the allocated time-slot. This would free up network engineers from having to manually set up the network for experiments.”
Lee will be spending some time next summer at the Swiss National Computing Center (CSCS) to share his SDN expertise and NERSC’s experiences working with it on Cori.
NERSC is also partnering with ESnet to explore how this capability can be integrated and extended across the WAN. The ultimate goal is to enable end users to provision end-to-end connectivity and bandwidth from facilities like the LCLS that extend into systems like Cori to enable real-time analysis.
“Highly capable networking has become a critical component of the science discovery workflow that integrates geographically distributed experimental and computational facilities into a ‘superfacility,’” said Inder Monga, director of ESnet. “Building an end-to-end, near-real-time interaction between science applications, workflows, local and wide-area network and leveraging SDN has the potential to dramatically increase the efficiency of scientific discovery. We look forward to collaborating with NERSC and other DOE facilities to experiment and enable this model.”
ESnet has been leading the experimentation, development and deployment of SDN within the research and education community and collaborating with multiple DOE ASCR-funded research projects to apply this technology for the benefit of science, he added.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 6,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. DOE Office of Science. »Learn more about computing sciences at Berkeley Lab.