NERSCPowering Scientific Discovery Since 1974

Victor Markowitz

Case Study Worksheet

Project Information - Microbial Genome and Metagenome Data Analysis

Document Prepared By Victor Markowitz
Project Title Microbial Genome and Metagenome Data Analysis
Principal Investigator Victor Markowitz
Participating Organizations Joint Genome Institute, Lawrence Berkeley National Lab
Science Category Climate Environmental Science Biological Sciences
Funding Agencies DOE SC DOE NSA NSF NOAA NIH Other:

Project Summary (Scientific Objectives)

Please give a brief description of your project and its scientific objectives for the next 3-5 years.

Maintenance of IMG (microbial genomes) and IMG/M (microbial metagenome) data management and analysis systems. The main metric for system size is number of putative genes: currently IMG includes 5.5 mil genes, while IMG/M includes 13 mil genes. The life cycle for both systems involves adding every 4 month newly published genomes and metagenomes to the systems, and adding every month new unpublished genomes and metagenomes for ongoing studies. System updates involve computation of pairwise gene similarities and gene classification (clustering) based on sequence similarities and/or functional annotations. 
 
The main challenge for the next 3 years is the rapid growth of the number of metagenome datasets and the size (in terms of genes or gene fragments) of these datasets. The latter is expected to grow from an average of tens of thousands of genes to tens of millions of genes. 

Current HPC Usage and Methods

Facilities Used NERSC NCCS ACLF NSF Centers
  • Other: JGI
Architectures Used Cray XT IBM Power BlueGene
  • Linux Cluster
Other:
Total Computational Hours Used per Year 300,000 Core-Hours NERSC Hours Used per Year 0 Core-Hours
Number of Cores Used in Typical Production Run 232 Wallclock Hours of Single Typical Production Run 340
Total Memory Used per Run 64 GB Minimum Memory Required per Core 4 GB
Total Data Read & Written per Run 500 GB Size of Checkpoint File(s) GB
Amount of Data Moved In/Out of NERSC GB How Often
On-Line File Storage Required (Directly Accesible from a Running Job) 20 GB 65,000,000 Files
Off-Line Archival Storage Required GB Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

The running of BLAST, all vs. all, BLAST against specific databases comprises most of the computations. Additionally, HMM protein searches is the next major use of the linux cluster. Clustering is done with the Markov Clustering Algorithm, CD-Hit, and single linkage algorithms. The latter uses a lot of memory, more than CPU's. Running BLAST rapidly and efficiently is the major challenge. 
 
Other major computations pertains to managing large amounts of stored file data, reformatting, reordering them. 

Please list the known limitations/obstacles/bottleneck of resources currently available HPC systems, and in particular, those at NERSC.

Disk IO on networked shared file systems is currently the bottleneck on our data processing. 

HPC Usage and Methods for the Next 3-5 Years

Anticipated changes to codes, mathematical methods and/or algorithms needed to achieve this project's scientific objectives.

Greater use of HMM searches, possibly the HMMER v3.0 package which is a faster more heuristic method than the previous versions. BLAST and manipulations of huge amount of file data will continue to dominate our computations. 
 

Computational Hours Required per Year
Anticipated Number of Cores to be Used in a Typical Production Run 900
Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above 340
Anticipated Total Memory Used per Run 256 GB
Anticipated Minimum Memory Required per Core 16 GB
Anticipated total data read & written per run 1500 GB
Anticipated size of checkpoint file(s) GB
Anticipated On-Line File Storage Required (Directly Accesible from a Running Job) 40 GB 1000,000,000 Files
Anticipated Off-Line Archival Storage Required 40 GB 1,000,000,000 Files

Known or Anticipated architectural requirements (e.g., 2 GB memory/core).

Please list any additional required or important software, services, or infrastructure beyond those listed in the previous section.

Need experimental test bed for alternative large distributed file systems, like for hadoop.

It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

We would like to test the existing multi-threaded BLAST of single host, with a large amount of RAM (say 256gb) so ram disk for BLAST databases can be used, with a large no. of multiple cores (say 1,000) for fast "on the fly" runtime BLAST for our UI applications. 
Speeding up runtime BLAST and avoiding large amounts of pre-computations and subsequent disk data management problems is one strategy we would like to test out. 

What Do You Need from NERSC?

Please tell us what you need from NERSC to meet your project's computing needs over the next 3-5 years. Also please feel free to make any general comments.