NERSCPowering Scientific Discovery Since 1974

Burst Buffer Architecture and Software Roadmap

NERSC has been working with Cray to bring Burst Buffer technology to the users of Cori. The NERSC Burst Buffer is based on Cray DataWarp that uses flash or SSD (solid-state drive) technology to significantly increase the I/O performance on Cori.

Motivation

In order to meet users' requests for better I/O performance NERSC has installed a Burst Buffer. There are two components of I/O performance that can be improved by the Burst Buffer:

  1. the total bandwidth available to an application. The higher the bandwidth, the faster a well-optimized application can read/write a large amount of data.
  2. the IOPS (Input/Output Operations per Second) of a file system. Many applications perform a large amount of small I/O operations, in this case IOPS becomes the limiting factor of performance.

Improved I/O performance can help science applications in many ways:

    • Improved application reliability (through faster checkpoint-restart)
    • Accelerated application I/O performance for small blocksize transfers and analysis files
    • Provides fast temporary space to out-of-core applications
    • Creates a staging area for jobs requiring large input files or persistent fast storage between coupled simulations
    • Post-processing analysis of large simulation data
    • In-transit visualization and analysis

Architecture

The image below illustrates a conceptual architecture of a burst buffer. The Burst Buffer resides on specialized nodes that bridges the internal interconnect of the compute system (Aries  HSN) and the SAN fabric of the storage system through the I/O nodes. Fast SSDs are installed in Burst Buffer nodes, which are made available to computing jobs through the Scheduler and DataWarp software stack. 



 

The flash memory for Cray DataWarp is attached to Burst Buffer nodes that are packaged two nodes to a blade. Each Burst Buffer node contains a Xeon processor 64 GB of DDR3 memory, and two 3.2 TB NAND flash SSD modules attached over two PCIe gen3 x8 interfaces.  Each Burst Buffer node is attached to a Cray Aries network interconnect over a PCIe gen3 x16 interface.  Each Burst Buffer node provides approximately 6.4 TB of usable capacity and a peak of approximately 5.7 GB/sec of sequential read and write bandwidth. The following graph illustrates the node-level architecture. 


This architecture provides a number of features that fits well with the scientific workload at NERSC. Including:

Scheduler Integration. Access to the Burst Buffer resource is integrated with the Scheduler of the system (SLURM in NERSC's case). The Scheduler provides the ability to provision the BurstBuffer resource to be shared by a set of users or jobs. It can also handle automatic migration to/from the flash storage.

Caching Mode. the Burst Buffer can also provide a caching mode, where the flash resource is used as a caching layer for the large Lustre file system. This mode is transparent to user codes and provides high performance I/O without the need of code modification. (Not yet available.)

In-transit Analysis. Allowing data to be processed and filtered on the BurstBuffer node, a model for exascale processing. (Not yet available.)  

Software Schedule 

The BurstBuffer software stack is expected to be delivered in four stages, depicted below. The Stage 1 of the Burst Buffer software was delivered with the Phase I of Cori system in Fall 2015. A call for proposals for the Burst Buffer Early Access program was completed in August 2015, and the successful applications can be found here

 

 

Burst Buffer availability on Cori

The Cori system provides approximately 1.7 TB/second of peak I/O performance with 28M IOPs, and about 1.8PB of storage.  The Burst Buffer is available to all Cori users, from both the Haswell and the KNL partitions. 
 

Using the Burst Buffer

Stage 1 of the DataWarp software provides an API for use of the Burst Buffer. Users will normally interface with this via the batch system (i.e. slurm) to define the Burst Buffer allocation such as the size and access mode (striping), and to specify whether the reservation should be persistent. A slurm interface can also stage data in to and out from the reservation. Full details and examples of suitable job scripts for the SLURM batch system (as the system is currently configured) are available here

A number of tutorials for the Burst Buffer are available. See for example the NUG Monthly Telecon tutorial, and the Data Day tutorial (videoslides).