NERSCPowering Scientific Discovery Since 1974

SC20 Tiered Storage Panel Recap: How Many Tiers Do We Really Need?

March 19, 2021

Screen Shot 2020 12 04 at 12.22.38 PM

By Glenn Lockwood, NERSC Advanced Technologies Group

This year at SC, I had the great honor of being invited to lend my perspective to The Diverse Approaches to Tiering HPC Storage panel alongside six luminaries of the storage industry. Although tiered storage is not new, there still isn't agreement on the right approach to take in designing the storage hierarchy within an HPC facility. Bob Murphy and Addison Snell, the masterminds behind this event, did a great job of bringing together panelists to argue many of these different philosophies so attendees could figure out for themselves what approaches made the most sense.

John Russell at HPCwire wrote an excellent summary of the perspectives for those who couldn't attend; in brief, the panelists presented the range of approaches their respective technology solutions take towards tiering. At one end of the spectrum was Wayne Sawdon's view that tiering is how we bridge the divide between traditional on-premise and emerging cloud-based HPC infrastructure. At the other end was Jeff Denworth's position that tiering is old news and it's possible to have a single, flat, all-flash tier at an economical price point by using the latest advances in storage media, network, algorithms, and protocols in innovative new ways.

Coincidentally, I was the only panelist representing a consumer of technology rather than a producer, so I had the luxury of being able to talk about the storage challenges we see at NERSC without having all the answers. Seeing as how my career at NERSC began with work on the prototype for what became Cori's burst buffer, I've lived and breathed the challenges of tiered storage for the last five years, and I took this panel as an opportunity to speak to our challenges as an HPC facility in the hopes that the smart people developing tiering technologies can tailor their designs to suit our needs.

I gave a five-minute position presentation as part of the panel in which I tried to make it clear that NERSC is not a one-size-fits-all sort of HPC facility; there's no singular HPC application that encompasses "the NERSC workload." This contrasts sharply with my experience doing HPC in industry; whereas an oil and gas company, animation studio, or biotech startup may have only a dozen critical workflows to support each year, NERSC often has to support hundreds of different workflows each month.

I like to show this slide when speaking to industry audiences because it quantifies this point:

Screen Shot 2020 12 04 at 11.11.28 AM

NERSC's diversity and heavy I/O workload prevents any one-size-fits-all storage solution from solving our users' problems.

 

Because NERSC's workload is so diverse, we cannot work with every single user to adapt their scientific workflow to the storage infrastructure we deploy; instead, we need to deploy storage that allows users to be productive with minimal code modifications while also pushing the envelope on advancing technology. We have the added challenge that NERSC's users operate on huge amounts of I/O in ways that defy traditional notions of what HPC workloads look like.

For example, there's a long-held belief that HPC workloads are write-heavy and HPC storage systems are optimized for writes. This is simply untrue at NERSC. The I/O volumes in the slide I showed reflect the total I/O observed on each storage system in 2018; what I didn't show are the other dimensions of storage design:

Data Source/Sink

Capacity

Performance

R/W Ratio

Annual I/O Volume

Burst buffer (NVMe)

1.8 PB

1,500 GB/s

40% reads

140 PB

Lustre (HDD)

30 PB

700 GB/s

60% reads

850 PB

Archive (tape)

> 200 PB

50 GB/s

20% reads

60 PB

ESnet (WAN)

infinite

25 GB/s

40% egress

10 PB

This means that NERSC's storage infrastructure needs to (1) be usable without any major workflow changes and (2) accommodate workflows that require all range of read and write performance and capacity.

Economics prevents us buying an infinitely fast and capacious storage system, so the net result is NERSC's current storage pyramid, which deploys different tiers with difference balances of performance and capacity:

Screen Shot 2020 12 04 at 12.19.07 PM

NERSC's four-tier storage hierarchy in 2019-2020. While economical (each tier above costs the same order of magnitude), it requires users to manage their data and metadata much more actively.

 

As I said in my panel presentation, this pretty picture is really the opposite of what we want from a usability standpoint; its existence requires that users track where in this hierarchy their data is over the lifetime of their scientific project. Moving data between tiers does not directly yield scientific insight, yet the hierarchy requires it to continue on that path to insight. In a sense, every minute a user spends thinking about data placement or waiting for it to move between tiers is a minute they're not advancing scientific discovery.

Put simply, tiers decelerate scientific discovery--the exact opposite of what HPC should be doing! This isn't hyperbole either; a study of how data flows throughout NERSC we conducted in 2019 found that an estimated 15-30% of all the I/O transiting our storage systems result from users moving data between storage tiers. That is to say, 15-30% of the I/O at NERSC is completely unnecessary in that it doesn't enable the computations that advance science.

This realization that continuing to add new storage tiers would work against our mission to advance science led us to develop a ten-year strategic plan around storage and tiering that we ultimately published as the Storage 2020 report. We concluded that storage technology and economics in 2020 and 2025 should allow us to successively collapse our storage tiers (thereby simplifying users' experiences and improving scientific productivity) without having to blow our storage budget:

Screen Shot 2020 12 04 at 12.22.38 PM

NERSC's Storage 2020 vision along with our 2020 milestone of execution. The big goal is to provide fewer tiers so our users shed fewer tears--a guiding principle some of us in the tiered storage community call "Lang's Law."

 

And as I showed in the above slide, we're on track with our 2020 plan with two major storage deployments underway this year.

First, NERSC's Community File System was put into production back in January, replacing the old project file system with a new capacity-optimized file system designed for reliability, ease of use, manageability, and data accessibility.  By following our Storage 2020 plan and focusing squarely on capacity and long-term growth of this tier, we were able to stretch our dollars by not paying for any performance we didn't need. We bought as few storage servers (the components that turn hard drive performance into file system performance) as possible and used the money we saved on more hard drives. This allowed us to bring online 64 PB of community storage in 2020, and we have another 64 PB being commissioned as I write this.

Second, we are also installing the Perlmutter 35 PB all-NVMe file system, which will fill the role previously occupied by both scratch and burst buffer tiers and thereby achieve our goal of eliminating a tier by 2020. When I talk about this file system to the public, the comment I often get is something along the lines of "only a DOE Lab could afford that" or "how much did that cost?" And indeed, that was at the crux of one of the questions asked at the panel -- how much more of the total system budget is our goal of collapsing tiers going to cost NERSC? Put differently, how many racks of compute do you have to give up to cover the cost of going all-flash?

Because we had an informed position on where flash prices would be in 2020 from our Storage 2020 effort, we didn't think about going all-flash as a matter of spending more money.  Instead, we asked ourselves if 2020 would be the right time to go all-flash given the constantly falling prices of NVMe, assuming no change in our relative storage budget. To answer this, we developed a simple analytical model that projected how much hot-tier capacity would be "enough" capacity to satisfy the needs of our users.

This work, which was presented during ISC'19 and published later that year, found that 30 PB was our required capacity given the size of Perlmutter, the aggregate I/O demands of NERSC's workloads, and NERSC's purge policy. Using the economic forecast from Storage 2020, we could also estimate what this much flash would cost in 2020, and when we realized it would cost the same (10-15% of our total system budget) that our platform storage has cost for previous generations of NERSC systems, the choice to go all-flash was a no-brainer. It fit our users' capacity needs, eliminated a tier, and it would not force us to give up compute racks. The fact that it will be a screaming-fast file system would be icing on the cake.

This still leaves us with three tiers though, and as our Storage 2020 plan (along with independent work by our colleagues at Los Alamos and Seagate) concluded, we can and should do better.  If you distill storage down into the fundamental ways in which NERSC users interact with it, there are really two different categories of data that users generate and access:

Screen Shot 2020 12 04 at 12.23.10 PM

The ideal storage hierarchy has two tiers: one for data you are actively using, and data you want to keep around just in case it's useful later. Coincidentally, independent work by our colleagues at Los Alamos National Laboratory and Seagate concluded this same ideality.

 

It follows that there really only needs to be two tiers of storage:

  1. Fast, high-performance storage that contains data that users are actively using for computation. This tier will contain all the data that a computational job may need to access (starting conditions, checkpoint files, plot files) and must support a standard file interface in addition to any higher-performance object or key-value interfaces, and its capacity is a function of how long users' scientific projects take to complete.
  2. High-capacity, searchable, and shareable storage where users can keep data just in case they need it in the future. This tier will contain all the data that they are no longer actively using for computation but may have high value to science in the future--think irreplaceable data sets (e.g., observations of rare phenomena, like supernovae) or reference copies that benefit many communities (the results of an expensive, high-resolution climate model). Because data is not being actively accessed by applications from this tier, an object interface (putting and getting giant blobs of data instead of reading out parts of files) works well here.

These logical uses of storage don't necessarily line up with the most economical physical storage media available in 2025 though, so this doesn't mean there will be a single "all-flash" tier and an "all-tape" tier. Instead, we'll have to rely increasingly on intelligence baked into software to combine multiple media types into a single logical storage system that meets the requirements of the tier.

For the cold tier, this is actually very straightforward; hierarchical storage management systems like HPSS have been doing this for decades by mixing an extreme-capacity tape library with a hard disk frontend in ratios that typically vary between 20:1 and 50:1. The hot tier is quite a bit more challenging because performance (the top criteria for this tier) will be sensitive to what media your data actually lives on. Fortunately, we're beginning to see hybrid hot tiers that combine a small amount of extreme-performance media (like persistent memory) with a huge quantity of less-expensive nonvolatile storage (like QLC flash) such that the aggregate performance of both media types comes out the same.

Time will tell if we will be able to hit our 2025 target of having only two tiers as we have with our 2020 target. The good news is that solid-state storage media prices continue to fall while the commercial AI industry is putting a lot more momentum behind solving the same tiered storage challenges as we are. However, because our vision for 2025 relies on smarter storage, it's clear that we will have to increase our investment in storage software and data management techniques; advancements in the storage hardware and software technologies we use will get us halfway there, but as I said before -- one-size-fits-all solutions don't work at NERSC. So we expect exciting times ahead as we continue to balance all the factors needed to get us to this simpler and better storage hierarchy.

Glenn Lockwood is a storage architect in NERSC’s Advanced Technologies Group who specializes in I/O performance analysis, extreme-scale storage architectures, and emerging I/O technologies. NERSC is a DOE Office of Science user facility.