Big Data = Big Storage Challenges
Among DOE supercomputing facilities, NERSC is at the forefront of data management and storage innovations
October 11, 2014
Contact: Kathy Kincade, +1 510 495 2124, [email protected]
100 petabytes. That’s how much data is currently stored at the Department of Energy’s (DOE) National Energy Research Scientific Computing Center (NERSC). And it’s increasing by 40 percent to 60 percent annually.
That’s a far cry from NERSC’s early days. When NERSC first opened its doors in 1974, files were typically measured in megabytes. In 1976, the center—then located at Lawrence Livermore National Laboratory (LLNL) and still known as the National Magnetic Fusion Energy Computer Center (NMFECC)—could store a whopping 19,200 megabytes of data, primarily on online disks and nine-track tapes.
But back then, users’ computing needs, and the hardware and software available to meet them, were quite different too. For example, managing data required staffers to move around behind the scenes. When a user filed a request, an operator would retrieve the tape from a rack and load it, then notify the user that the data were available. In some cases, the tapes were stored in a separate building at LLNL and were often picked up by a staffer riding a bicycle! Fortunately, delivery of the Automated Tape Library in 1979 changed this practice by allowing hands-off access.
The software used to manage the data back then offered its own set of challenges, according to Keith Fitzgerald, who first worked at the center as a contractor helping to maintain the first supercomputers at NERSC (then NMFECC) and went on to lead NERSC’s Mass Storage Group.
“When NERSC was first installed, there were no off-the-shelf storage systems, not even backup systems like are common today,” Fitzgerald said. “Jean Schuler, who still works at LLNL, wrote the original storage system, called PackRat. It was a primitive system, more like a utility that would archive something and be able to give it back, but it had online disks and tapes and backup.”
PackRat quickly evolved into a homegrown system called FILEM that was used for storing codes, data or any information the user wanted saved longer than 24 hours. FILEM allowed either permanent or temporary storage, privacy, the ability to share files with other users and the ability to group files under directories.
Initially all files were stored on disk. A user would make a “write” request from the Control Data Corp. 7600 supercomputer and the FILEM program would acquire disk space to store the file in, move the data, verify it moved correctly and update the directory so the file could be retrieved later. Files were migrated to tapes from disk as increased user demands led to insufficient disk space.
“One of the first things we did at LLNL was the allocation system,” Fitzgerald said. “Users got allocations for both computing and storage, and they could move the allocation from one to the other as they decided whether they wanted to compute or store results. It was their choice, not management’s, and they had to make a conscious decision every time they asked for an allocation.”
While FILEM offered some challenges in terms of stability, NERSC relied on it for about a decade, according to Fitzgerald.
“Eventually we looked for a more off-the-shelf solution and converted to the common file system (CFS) storage system that Los Alamos National Laboratory was using,” he said. “It was IBM-based, cost effective, powerful and supported by somebody else.”
National Storage Laboratory
In the early 1990s, as the amount of data being processed and archived at NERSC continued to grow, NERSC and LLNL were part of the DOE’s National Storage Laboratory (NSL) testbed project. NSL was based on a product called UniTree, according to Fitzgerald, and as the project progressed it began running alongside of CFS. It eventually evolved into the High Performance Storage System (HPSS) and migrated with NERSC when the center moved to Berkeley Lab in 1996. Today HPSS continues to serve as the center’s largest data repository.
HPSS is a hierarchical storage management (HSM) software system that enables all user data to be ingested onto high performance disk arrays and automatically migrated to a very large enterprise tape subsystem for long-term retention. The disk cache in HPSS is designed to retain five days’ worth of new data, while the tape subsystem is designed to provide the most cost-effective, long-term, scalable data storage available.
At present, HPSS on tape (including backup and archive systems) totals over 68 petabytes of data and is growing at about 60 percent annually, according to Jason Hick, who leads NERSC’s Storage Systems Group. Other storage resources at NERSC include the NERSC Global Filesystem (NGF), which totals over 13 petabytes on disk and is growing at about 40 percent annually; and local scratch, which totals over 9 petabytes on disk but doesn’t grow because it is regularly purged.
Hick spends much of his time thinking about how NERSC will continue to stay abreast of users’ data storage needs over the next five to 10 years.
“I think we have the hardware, bandwidth and capacity part of this nailed,” Hick said. “We can project the hardware demands really well. But software is the weakest link. It’s not keeping up with user needs and demands, and it never has. And users are pressing us on usability. Looking ahead, that is our biggest challenge: usability of storage.”
More Data from More Places
For most of its first 40 years, NERSC was an exporter of data as scientists ran large-scale simulations and then moved that data to other sites for analysis. But with the growth of experimental data coming from facilities all over the U.S. and other countries, NERSC has now become a net data importer, taking in a petabyte of data each month for storage, analysis and sharing in fields ranging from bioscience and environmental studies to cosmology and high-energy physics (HEP).
While HEP research has historically accounted for the majority of storage needs at NERSC, the last few years has seen a shift toward other research fields, according to Hick.
“In the last five years, the climate community has in some cases surpassed HEP in terms of storage need,” he said. “The climate guys suddenly started taking 50 percent of our storage and this wasn’t forecasted well. We were like ‘wow, these folks really do have a need.’”
That need included helping NERSC’s users in the climate community overcome some unique workflow challenges, Hick added.
“With HEP, it was mostly a hardware challenge; they wrote their own programs and we just needed to provide the hardware,” Hick said. “But the climate guys had a workflow problem. They are a worldwide community and immediately presented the challenge of sharing data across continents and facilities that don’t normally like talking to each other because of security and firewalls. Large amounts of data were being exchanged as a matter of routine. That wasn’t the case with HEP—they were more about local bandwidth and scale and capacity.”
The influx of new experimental facilities—such as the Advanced Light Source and Joint Genome Institute at Berkeley Lab and the Linear Coherent Light Source at SLAC—are also driving this “data deluge.”
“With the experimental facilities, not only do they want to ingest data and move it from one place to another, they want to do it in real time,” Hick said. “They are no longer just analyzing a data set, they are using that data in real time to adjust their experiments as they go.”
This is where NERSC’s expertise comes into play, he added. NERSC provides some of the largest open computing and storage systems available to the global scientific community and continually evolves its systems to ensure that users are never presented with an entirely new system at any one time.
“Instead, we provide constant stewardship and expansion of our systems,” Hick said.
Data Management Strategy
In addition, NERSC has long had a data management policy that helps ensure it is well prepared to meet users anticipated needs as they evolve. And its policy is unique among DOE supercomputing centers.
“We recognize that to delete data or cause a user to remove that data is disruptive,” Hick said. “We are a user facility, so if anyone should be trying to keep up with demand it ought to be us. It’s complicated because it drives what we do year to year—how we design our system, spend our budget, make budget recommendations, provisions for storage, the number of devices, the quality of the devices, all kinds of things. And we try to design business practices around the policy to keep up.”
For example, NERSC has developed a sponsored storage model that allows people to take advantage of the existing solution and buy into it in order to gain access to larger amounts of storage space. It will be introduced in the upcoming fiscal year. NERSC is also working on new storage allocation options, he added.
“A storage resource is what we are after,” he said. “The idea is to make it ‘fungible,’ to offer flexibility in the types of storage users can have. A user might want disk storage, tape storage, flash storage, scratch storage or some combination thereof. And we are going to allow them to make tradeoffs. We will give them a price list and say ok, you’ve got x amount of ‘NERSC dollars’ to spend on storage, how do you want to spend it? It’s a more capitalistic approach that better helps us design the systems and scale them to what our users need.”
For more information on NERSC storage trends:
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.