NERSCPowering Scientific Discovery Since 1974

Joint Genome Institute (JGI)

nicole-illumina-flowcell.jpg

Key Challenges: Ensuring that there is a robust computational infrastructure for managing, storing and gleaning scientific insights from the torrent of data that constantly flows from the advanced sequencing platforms at the Department of Energy Joint Genome Institute (JGI).  JGI sequencing capacity exceeds 40 billion DNA base pairs per year and is growing at a rate that exceeds computer hardware improvements, with exponential increases in computation and storage needed. JGI will generate about 1 petabyte of data in their first year as a NERSC partner; this is expected to double each year. 

Why it Matters:  JGI is the primary production sequencing facility for the DOE.  By revealing the genetic blueprint and fundamental principles that control biological systems, scientists throughout the DOE community will be able to better carry out (DOE) missions in energy security, climate protection, and environmental remediation.  A significant portion of JGI's projects are related to bioenergy and focus on three areas: developing plant feedstocks; using microbes to break down cellulose in plant cell walls; and fermenting sugars into biofuels.  NERSC systems are used to assemble and perform comparative analysis for genomes from Plant, Fungal, Microbial, and metagenome communities.

Accomplishments: NERSC resources were critical to completing two of quarterly releases of the Integrated Microbial Genomes (IMG) system.  The Metagenome pipeline has been ported to run on Franklin, Carver, and Hopper and NERSC systems were used to develop frameworks for running additional pipelines and analysis.  (This increased reliance on NERSC resources is similar to that of other major experimental user facilities such as RHIC and ATLAS.)  Genomics is also playing an important role in NERSC's cloud research: scientists deployed a MapReduce cluster running Hadoop that allowed JGI to remove errors from 5 billion reads of “next generation” sequence data (Rumen HiSeq dataset). 

Investigators: Edward Rubin, Victor Markowitz, Shane Canon, Jeremy Brand,  (LBNL)

More Information: See the JGI web site