Lee Ann McCue
Case Study Worksheet
Project Information - Mathematical and computational models of transcription regulation
| Document Prepared By | Lee Ann McCue |
|---|---|
| Project Title | Mathematical and computational models of transcription regulation |
| Principal Investigator | Lee Ann McCue |
| Participating Organizations | PNNL, Brown University, Wadsworth Center |
| Science Category | Climate Environmental Science Biological Sciences |
| Funding Agencies | DOE SC DOE NSA NSF NOAA NIH Other: |
Project Summary (Scientific Objectives)
Please give a brief description of your project and its scientific objectives for the next 3-5 years.
The transcription regulatory network exerts the most fundamental control over the abundance of virtually all of a cell’s functional macromolecules. Comparative genomics has proven to be a powerful bioinformatics method with which to study transcription regulation. Our long term goals are to extend these computational approaches with the inclusion of functional genomic and proteomic data, to identify the ensemble of solutions that have been explored by bacteria to regulate key metabolic processes, to extend our phylogenetic Gibbs sampling algorithms to reconstruct the joint posterior space of the ancestral states of regulatory motifs, and to develop point estimates and confidence limits for these discrete high-dimensional objects.
Current HPC Usage and Methods
| Facilities Used | NERSC | NCCS | ACLF | NSF Centers |
|
|---|---|---|---|---|---|
| Architectures Used | Cray XT | IBM Power | BlueGene |
|
Other: |
| Total Computational Hours Used per Year | Core-Hours | NERSC Hours Used per Year | 0 Core-Hours | ||
| Number of Cores Used in Typical Production Run | (10-140 nodes) X 2000 genes | Wallclock Hours of Single Typical Production Run | 1000 | ||
| Total Memory Used per Run | 2/node GB | Minimum Memory Required per Core | 2 GB | ||
| Total Data Read & Written per Run | .1 GB | Size of Checkpoint File(s) | NA GB | ||
| Amount of Data Moved In/Out of NERSC | GB | How Often | per month | ||
| On-Line File Storage Required (Directly Accesible from a Running Job) | 0.01 GB | 10-140,1 file per random restart Files | |||
| Off-Line Archival Storage Required | 0.5 GB | 30,000 Files | |||
Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.
Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)
Gibbs centroid Sampler- Gibbs sampling algorithm, running typically 50 independent chains, although the number of chains may increase with problem size up to several thousand. Each random restart requires approximately 10,000 iterations. Data sets may vary in size from a few hundred nucleotides to several hundred thousand. Typical production runs consist of approximately 2000 genes (3-5000 bytes each), each running 50 independent chains of 10,000 iterations each.
Bayesian Segmentation algorithm - mN^2 algorithm where N is the average length of a data sequence and m is the number of sequences.
Parallelism is expressed using MPI
Please list the known limitations/obstacles/bottleneck of resources currently available HPC systems, and in particular, those at NERSC.
Node memory size for lager problems
HPC Usage and Methods for the Next 3-5 Years
Anticipated changes to codes, mathematical methods and/or algorithms needed to achieve this project's scientific objectives.
Proposed enhancements to the Gibbs sampling program (Bayesian estimation of phylogeny etc.) will increase the need computer time probably by a factor of up to 5.
| Computational Hours Required per Year | ||
|---|---|---|
| Anticipated Number of Cores to be Used in a Typical Production Run | 10-140 | |
| Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above | 150 | |
| Anticipated Total Memory Used per Run | 8/node GB | |
| Anticipated Minimum Memory Required per Core | 8 GB | |
| Anticipated total data read & written per run | 0.1 GB | |
| Anticipated size of checkpoint file(s) | NA GB | |
| Anticipated On-Line File Storage Required (Directly Accesible from a Running Job) | 0.1 GB | 100-150 Files |
| Anticipated Off-Line Archival Storage Required | 1 GB | 100,000 Files |
Known or Anticipated architectural requirements (e.g., 2 GB memory/core).
4-8 GB/node
Please list any additional required or important software, services, or infrastructure beyond those listed in the previous section.
It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.
Our algorithms tend to be inherently parallel (many independent genes in parallel, independent Gibbs sampling chains). They scale well with increasing cores/nodes.
What Do You Need from NERSC?
Please tell us what you need from NERSC to meet your project's computing needs over the next 3-5 years. Also please feel free to make any general comments.


