NERSCPowering Scientific Discovery Since 1974

Lee Ann McCue

Case Study Worksheet

Project Information - Mathematical and computational models of transcription regulation

Document Prepared By Lee Ann McCue
Project Title Mathematical and computational models of transcription regulation
Principal Investigator Lee Ann McCue
Participating Organizations PNNL, Brown University, Wadsworth Center
Science Category Climate Environmental Science Biological Sciences
Funding Agencies DOE SC DOE NSA NSF NOAA NIH Other:

Project Summary (Scientific Objectives)

Please give a brief description of your project and its scientific objectives for the next 3-5 years.

The transcription regulatory network exerts the most fundamental control over the abundance of virtually all of a cell’s functional macromolecules. Comparative genomics has proven to be a powerful bioinformatics method with which to study transcription regulation. Our long term goals are to extend these computational approaches with the inclusion of functional genomic and proteomic data, to identify the ensemble of solutions that have been explored by bacteria to regulate key metabolic processes, to extend our phylogenetic Gibbs sampling algorithms to reconstruct the joint posterior space of the ancestral states of regulatory motifs, and to develop point estimates and confidence limits for these discrete high-dimensional objects. 

Current HPC Usage and Methods

Facilities Used NERSC NCCS ACLF NSF Centers
  • Other: Brown Universiity Center for Computational Visualization
Architectures Used Cray XT IBM Power BlueGene
  • Linux Cluster
Other:
Total Computational Hours Used per Year Core-Hours NERSC Hours Used per Year 0 Core-Hours
Number of Cores Used in Typical Production Run (10-140 nodes) X 2000 genes Wallclock Hours of Single Typical Production Run 1000
Total Memory Used per Run 2/node GB Minimum Memory Required per Core 2 GB
Total Data Read & Written per Run .1 GB Size of Checkpoint File(s) NA GB
Amount of Data Moved In/Out of NERSC GB How Often per month
On-Line File Storage Required (Directly Accesible from a Running Job) 0.01 GB 10-140,1 file per random restart Files
Off-Line Archival Storage Required 0.5 GB 30,000 Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

Gibbs centroid Sampler- Gibbs sampling algorithm, running typically 50 independent chains, although the number of chains may increase with problem size up to several thousand. Each random restart requires approximately 10,000 iterations. Data sets may vary in size from a few hundred nucleotides to several hundred thousand. Typical production runs consist of approximately 2000 genes (3-5000 bytes each), each running 50 independent chains of 10,000 iterations each. 
 
Bayesian Segmentation algorithm - mN^2 algorithm where N is the average length of a data sequence and m is the number of sequences.  
 
Parallelism is expressed using MPI

Please list the known limitations/obstacles/bottleneck of resources currently available HPC systems, and in particular, those at NERSC.

Node memory size for lager problems

HPC Usage and Methods for the Next 3-5 Years

Anticipated changes to codes, mathematical methods and/or algorithms needed to achieve this project's scientific objectives.

Proposed enhancements to the Gibbs sampling program (Bayesian estimation of phylogeny etc.) will increase the need computer time probably by a factor of up to 5.

Computational Hours Required per Year
Anticipated Number of Cores to be Used in a Typical Production Run 10-140
Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above 150
Anticipated Total Memory Used per Run 8/node GB
Anticipated Minimum Memory Required per Core 8 GB
Anticipated total data read & written per run 0.1 GB
Anticipated size of checkpoint file(s) NA GB
Anticipated On-Line File Storage Required (Directly Accesible from a Running Job) 0.1 GB 100-150 Files
Anticipated Off-Line Archival Storage Required 1 GB 100,000 Files

Known or Anticipated architectural requirements (e.g., 2 GB memory/core).

4-8 GB/node

Please list any additional required or important software, services, or infrastructure beyond those listed in the previous section.

It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

Our algorithms tend to be inherently parallel (many independent genes in parallel, independent Gibbs sampling chains). They scale well with increasing cores/nodes.

What Do You Need from NERSC?

Please tell us what you need from NERSC to meet your project's computing needs over the next 3-5 years. Also please feel free to make any general comments.