NERSC Develops Archiving Strategies for Genome Researchers

October 1, 2005

When researchers at the Production Genome Facility at DOE’s Joint Genome Institute found they were generating data faster than they could find somewhere to store the files, let alone make them easily accessible for analysis, a collaboration with NERSC’s Mass Storage Group developed strategies for improving the reliability of storing the data while also making retrieval easier.

DOE’s Joint Genome Institute (JGI) is one of the world’s leading facilities in the scientific quest to unravel the genetic data that make up living things. With advances in automatically sequencing genomic information, scientists at the JGI’s Production Genome Facility (PGF)found themselves overrun with sequence data, as their production capacity had grown so rapidly that data had overflowed the existing storage capacity. Since the resulting data are used byresearchers around the world, ensuringthe data are both reliably archived and easily retrievable are key issues.

As one of the world’s largest public DNA sequencing facilities, the PGF produces 2million files per month of trace data (25 to100 KB each), 100 assembled projects per month (50 MB to 250 MB), and several very large assembled projects per year (~50 GB). In aggregate, this averages about 2000 GB per month.

In addition to the amount of data, a major challenge is that way the data are produced. Data from the sequencing of many different organisms are produced in parallel each day, such that a daily “archive” spreads the data for a particular organism over many tapes.

DNA sequences are considered the fundamental building blocks for the rapidly expanding field of genomics. Constructing a genomic sequence is an iterative process. The trace fragments are assembled, and then the sequence is refined by comparing it with other sequences to confirm the assembly. Once the sequence is assembled, information about its function is gleaned by comparing and contrasting the sequence with other sequences from both the same organism and other organisms. Current sequencing methods generate a large volume of trace files that have to be managed — typically 100,000 files or more. And to check for errors in the sequence or make detailed comparisons with other sequences, researchers often need to refer back to these traces. Unfortunately, these traces are usually provided as a group of files with no information as to where the traces occur in the sequence, making the researcher’s job more difficult.

This problem was compounded by the PGF’s lack of sufficient online storage, which made organization (and subsequent retrieval) of the data difficult and led to unnecessary replication of files. This situation required significant staff time to move files and reorganize file systems to find sufficient space for ongoing production needs; and it required auxiliary tape storage that was not particularly reliable.

Enter NERSC’s Archiving Expertise

Staff from NERSC’s Mass Storage Group and the PGF agreed to work together to address two key issues facing the genome researchers. The most immediate goal was for NERSC’s High Performance Storage System to become the archive for the JGI data, replacing the less-reliable local tape operation and freeing up disk space at the PGF for more immediate production needs. The second goal was to collaborate with JGI to improve the data handling capabilities of the genome sequencing and data distribution processes.

NERSC storage systems are robust and available 24 hours a day, seven days a week, as well as highly scalable and configurable. NERSC has high-quality, high-bandwidth connectivity to the other DOE laboratories and major universities provided by ESnet.

Most of the low-level data produced by the PGF are now routinely archived at NERSC, with ~50 GBs worth of raw trace data being transferred from JGI to NERSC each night.

The techniques used in the developing the archiving system allow it to be scaled up over time as the amount of data continues to increase — up to billions of files can be handled with these techniques. The data havebeen aggregated into larger collections which hold tens of thousands of files in a single file in the NERSC storage system. This data can now be accessed as one large file, or each individual file can be accessed without retrieving the whole aggregate.

And not only will the new techniques be able to handle future data, they also helped when the PGF staff discovered that raw data that had been previously processed by software that had an undetected “bug.” The staff were able to retrieve the raw data from NERSC and reprocess it in about 1? months, rather than go back to thesequencing machines and produce the data all over again — which would have taken about six months. In addition to saving time, this also saves money — a rough estimate is that the original data collection comprised up to 100,000 files/day at a cost of $1 per file, which added up to $1.2 million for processing six months’ worth of data. Comparing this figure to the cost of a month and a half of staff time, the estimated savings are about $1 million—and the end result is a more reliable archive.

About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.