NERSCPowering Scientific Discovery for 50 Years

Designing an All-Flash File System

NERSC's Perlmutter system will feature a 30 PB all-NVMe Lustre file system capable of over 4 TB/sec write bandwidth.  Because this file system will be the first all-NVMe file system deployed at this scale, extensive quantitative analysis was undertaken by NERSC to determine

  1. Will 30 PB of capacity be enough for a system of Perlmutter's capability?
  2. What is the best SSD endurance rating to balance cost and longevity of the system its five-year service life?
  3. How should advanced Lustre features such as Data-on-MDT be incorporated into the system design?

This analysis relied heavily upon workload data collected over the service life of Cori, and the pytokio library was used to connect the different data sources required for this study together.  The analytical methods used and the final system architecture were presented at several venues in 2019:

The final version of the analysis, presented at the HPC-IODC Workshop in Frankfurt, was accompanied by all of the source data and Jupyter Notebooks (doi: 10.5281/zenodo.3261815).  The dataset includes

  • Total read and write traffic to Cori's 30 PB Lustre file system over three years, resolved at 24 hour intervals
  • Statistics about the most-full OST in Cori's Lustre 30 PB Lustre file system sampled daily over one year
  • Distribution of inode sizes measured on Cori's Lustre file system at two points in time
  • Distribution of directory sizes measured on Cori's Lustre file system at two points in time (not presented in the above analyses)
  • SMART data from all SSDs in Cori's burst buffer after approximately four years in service

Any updates to this dataset and analysis can also be tracked on the Lustre Design Analysis GitHub repository.  Additional data exists within NERSC and may be published to this repository at a later date.