NERSC and JGI Join Forces to Tackle Genomics HPC
April 19, 2010
A torrent of data has been flowing from the advanced sequencing platforms at the Department of Energy Joint Genome Institute (JGI), among the world's leading generators of DNA sequence information for bioenergy and environmental applications. Last year, JGI generated over one trillion nucleotide letters of genetic code for its various user programs, an eight-fold increase in productivity from 2008. This year JGI expects to sequence five-times more data than the previous year, producing more than a petabyte of data.
To ensure that there is a robust computational infrastructure for managing, storing and gleaning scientific insights from this ever-growing flood of data, JGI is joining forces with the National Energy Research Scientific Computing Division (NERSC) at the Lawrence Berkeley National Laboratory (Berkeley Lab), which serves more than 3,500 science users annually who are researching problems in a variety of disciplines from combustion to climate.
The NERSC division will perform this work on a cost-recovery basis similar to the way it manages and supports the Parallel Distributied Systems Facility (PDSF) for High Energy and Nuclear Physics research. Computing systems will be split between JGI's campus in Walnut Creek, Calif. and NERSC's Oakland Scientific Facility, which are 20 miles apart. The NERSC Division will also manage JGI's six-person systems staff to integrate, operate and support these systems.
"We evaluated a wide variety of options and after a thorough review, it made perfect sense to partner with NERSC," said Vito Mangiardi, JGI's Deputy Director for Business Operations & Production. "With a successful track-record in providing these kinds of services they don't have the steep learning curve to climb."
"This is a great partnership because data-centric computing is an important future direction for NERSC, and genomics is seeing exponential increases in computation and storage," says NERSC Division Director Kathy Yelick. "The computing requirements for the genomics community are quite different from NERSC's more traditional workload. The science is heavy in data analytics and JGI runs web portals providing access to genomic information."
"These are critically important assets that we can now bring to bear on some of the most complex questions in biology," said JGI Director Eddy Rubin. "We really need the massive computation 'horsepower' and supporting infrastructure that NERSC offers to help us advance our understanding of the carbon cycle and many of the other biogeochemical processes in which microbes play a starring role and that the JGI is characterizing in a massively parallel fashion."
Data is currently flowing between JGI and NERSC over the Energy Sciences Network's Science Data Network (ESnet's SDN). The SDN provides circuit-oriented services to enable the rapid movement of massive scientific data sets, and to support innovative computing architectures such as those being deployed at NERSC in support of JGI science. SDN provides native access by the computational resources at NERSC to data stored at JGI, such as those generated by the Great Prairie Soil Metagenomes Project that JGI is piloting for the DOE Grand Challenge Program.
History of Genomics Computing Collaborations
Although the partnership between theNERSC Division and JGI became official only recently the two centers, which are managed by the Berkeley Lab, have increasingly collaborated to develop solutions to the genomic community's growing demand for computational power and storage.
Five years ago, when researchers at the Production Genome Facility at JGI found that they were generating data so fast they couldn't find anywhere to store the files, they collaborated with mass storage systems staff at NERSC to archive genomics data on the center's High Performance Storage System (HPSS). The NERSC staff also helped the researchers develop strategies for making the retrieval of this data easier.
In the summer of 2009, Shane Canon, who heads the Technology Integration Group at NERSC, began engaging in discussions with Victor Markowitz who heads the Berkeley Lab's Biological Data Management and Technology Center (BDMT) and several JGI staff members to identify the near-and future-term computing challenges facing the genomics community and to evaluate potential solutions.
"The amount of data being generated by the advanced sequencing platforms at JGI was quickly outpacing their computing capabilities, so we wanted to see if NERSC's supercomputers could be leveraged to meet their needs. One of the first challenges we took on was getting BLAST, a popular and computationally intensive bioinformatics tool, to run on NERSC';s Cray XT4 'Franklin' system," says Canon, who notes that researchers have run genomics codes on NERSC systems before with limited success.
"Early in our discussions, we learned that genomics researchers tend to run a lot of serial jobs, but systems like Franklin are optimized to run large parallel jobs, so we worked with them to develop a framework that ran several serial jobs within a parallel job," he adds.
Preliminary runs showed that the framework successfully scaled BLAST-based IMG/M jobs to the entire Franklin system, containing over 32,000 cores, with essentially linear speedup. Since then, the framework has also been successfully adapted to run other genomics applications with similar success.
In March 2010, JGI also became an early user of NERSC's Magellan Cloud Computing System when the center had a sudden need for increased computing resources. In less than three days, NERSC and JGI staff provisioned and configured hundreds of processor cores on Magellan to match the computing environment available on JGI's local compute clusters. At the same time, staff at both centers collaborated with ESnet network engineers to deploy a dedicated 9Gbps virtual circuit between Magellan and JGI over SDN within 24 hours.
This strategy gives JGI researchers around the world increased computational capacity without any change to their software or workflow. JGI users still log on to the Institute's network and submit scientific computing jobs to its batch queues are managed by hardware located in Walnut Creek. Once the jobs reach the front of the queue, the information travels 20 miles on reserved SDN bandwidth, directly to NERSC's Magellan system in Oakland, After a job has finished, the results are sent back to Walnut Creek on the SDN within milliseconds to be saved on filesystems at JGI.
"What makes this use of cloud computing so attractive is that JGI users do not notice a difference between computing on Magellan, which is 20 miles away at NERSC, or on JGI's computing environment in Walnut Creek," says Jeff Broughton, who heads NERSC's Systems Department.
For more information about NERSC and JGI collaborations, please read »NERSC's Archiving Strategies Help JGI Genome Researchers Store and Sort Billions of Data Files.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.